Context: My experience is primarily with psychology papers (heuristics & biases, social psych, and similar areas), and it seems to generalize pretty well to other social science research and fields with similar sorts of methods.
One way to think about this is to break it into three main questions:
1. Is this "result" just noise? Or would it replicate?
2. (If there's something besides noise) Is there anything interesting going on here? Or are all the "effects" just confounds, statistical artifacts, demonstrating the obvious, etc.
3. (If there is something interesting going on here) What is going on here? What's the main takeaway? What can we learn from this? Does it support the claim that some people are tempted to use it to support?
There is some benefit just to explicitly considering all three questions, and keeping them separate.
For #1 ("Is this just noise?") people apparently do a pretty good job of predicting which studies will replicate. Relevant factors include:
1a. How strong is the empirical result (tiny p value, large sample size, precise estimate of effect size, etc.).
1b. How plausible is this effect on priors? Including: How big an effect size would you expect on priors? And: How definitively does the researchers' theory predict this particular empirical result?
1c. Experimenter degrees of freedom / garden of forking paths / possibility of p-hacking. Preregistration is best, visible signs of p-hacking are worst.
1d. How filtered is this evidence? How much publication bias?
1e. How much do I trust the researchers about things like (c) and (d)?
I've found that this post on how to think about whether a replication study "failed" also seems to have helped clarify my thinking about whether a study is likely to replicate.
If there are many studies of essentially the same phenomenon, then try to find the methodologically strongest few and focus mainly on those. (Rather than picking one study at random and dismissing the whole area of research if that study is bad, or assuming that just because there are lots of studies they must add up to solid evidence.)
If you care about effect size, it's also worth keeping in mind that the things which turn noise into "statistically significant results" also tend to inflate effect sizes.
For #2 ("Is there anything interesting going on here?"), understanding methodology & statistics is pretty central. Partly that's background knowledge & expertise that you keep building up over the years, partly that's taking the time & effort to sort out what's going on in this study (if you care about this study and can't sort it out quickly), sometimes you can find other writings which comment on the methodology of this study which can help a lot. You can try googling for criticisms of this particular study or line of research (or check google scholar for papers that have cited it), or google for criticisms of specific methods they used. It is often easier to recognize when someone makes a good argument than to come up with that argument yourself.
One framing that helps me think about a study's methodology (and whether or not there's anything interesting going on here) is to try to flesh out "null hypothesis world": in the world where nothing interesting is going on, what would I expect to see come out of this experimental process? Sometimes I'll come up with more than one world that feels like a null hypothesis world. Exercise: try that with this study (Egan, Santos, Bloom 2007). Another exercise: Try that with the hot hand effect.
#3 ("What is going on here?") is the biggest/broadest question of the three. It's the one that I spend the most time on (at least if the study is any good), and it's the one that I could most easily write a whole bunch about (making lots of points and elaborating on them). But it's also the one that is the most distant from Eli's original question, and I don't want to turn those post into a big huge essay, so I'll just highlight a few things here.
A big part of the challenge is thinking for yourself about what's going on and not being too anchored on how things are described by the authors (or the press release or the person who told you about the study). Some moves here:
3a. Imagine (using your inner sim) being a participant in the study, such that you can picture what each part of the study was like. In particular, be sure that you understand every experimental manipulation and measurement in concrete terms (okay, so then they filled out this questionnaire which asked if you agree with statements like such-and-such and blah-blah-blah).
3b. Be sure you can clearly state the pattern of results of the main finding, in a concrete way which is not laden with the authors' theory (e.g. not "this group was depleted" but "this group gave up on the puzzles sooner"). You need this plus 3a to understand what happened in the study, then from there you're trying to draw inferences about what the study implies.
3c. Come up with (one or several) possible models/theories about what could be happening in this study. Especially look for ones that seem commonsensical / that are based in how you'd inner sim yourself or other people in the experimental scenario. It's fine if you have a model that doesn't make a crisp prediction, or if you have a theory that seems a lot like the authors' theory (but without their jargon). Exercise: try that with a typical willpower depletion study.
3d. Have in mind the key takeaway of the study (e.g., the one sentence summary that you would tell a friend; this is the thing that's the main reason why you're interested in reading the study). Poke at that sentence to see if you understand what each piece of it means. As you're looking at the study, see if that key takeaway actually holds up. e.g., Does the main pattern of results match this takeaway or do they not quite match up? Does the study distinguish the various models that you've come up with well enough to strongly support this main takeaway? Can you edit the takeaway claim to make it more precise / to more clearly reflect what happened in the study / to make the specifics of the study unsurprising to someone who heard the takeaway? What sort of research would it take to provide really strong support for that takeaway, and how does the study at hand compare to that?
3e. Look for concrete points of reference outside of this study which resemble the sort of thing the researchers are talking about. Search in particular for ones that seem out-of-sync with this study. e.g., This study says not to tell other people your goals, but the other day I told Alex about something I wanted to do and that seemed useful; do the specifics of this experiment change my sense of whether that conversation with Alex was a good idea?
Some narrower points which don't neatly fit into my 3-category breakdown:
A. If you care about effect sizes then consider doing a Fermi estimate, or otherwise translating the effect size into numbers that are intuitively meaningful to you. Also think about the range of possible effect sizes rather than just the point estimate, and remember that the issues with noise in #1 also inflate effect size.
B. If the paper finds a null effect and claims that it's meaningful (e.g., that the intervention didn't help) then you do care about effect sizes. (e.g., If it claims the intervention failed because it had no effect on mortality rates, then you might assume a value of $10M per life and try to calculate a 95% confidence interval on the value of the intervention based solely on its effect on mortality.)
C. New papers that claim to debunk an old finding are often right when they claim that the old finding has issues with #1 (it didn't replicate) or #2 (it had methodological flaws) but are rarely actually debunkings if they claim that the old finding has issues with #3 (it misdescribes what's really going on). The new study on #3 might be important and cause you to change your thinking in some ways, but it's generally an incremental update rather than a debunking. Examples that look to me like successful debunkings: behavioral social priming research (#1), the Dennis-dentist effect (#2), the hot hand fallacy (#2 and some of B), the Stanford Prison Experiment (closest to #2), various other things that didn't replicate (#1). Examples of alleged "debunkings" which seem like interesting but overhyped incremental research: the bystander effect (#3), loss aversion (this study) (#3), the endowment effect (#3).
Awards for the Best Answers
When this question was posted a month ago, I liked it so much that I offered $100 of my own money for what I judged to be the best answer and another $50 to the best distillation. Here's what I think:
Overall prize for best answer ($100): Unnamed
Additional prizes ($25): waveman, Bucky
I will reach out to these authors via DM to arrange payment.
No one attempted to me what seemed like a proper distillation of other responses so I won't be awarding the distillation prize here, however I intend to write and publish my own distillation/synthesis of the responses soon.
Some thoughts on each of the replies:
Unnamed [winner]: This answer felt very thorough and detailed, and it feels like it's a guide I could really follow to dramatically improve my ability to assess studies. I'm assuming limitations of LW's current editor meant the formatting couldn't be nicer, but I also really like Unnamed broke down his overall response into three main questions ("Is this just noise?", "Is there anything interesting going on here?" and "What is going on here?") and then presented further sub-questions and examples to help one assess the high-level questions.
I'd like to better summarize Unnamed's response, you should really just read it all.
waveman [winner]: waveman's reply hits a solid amount of breadth in how to assess studies. I feel like his response is any easy guide I could pin up my wall and easily step through while reading papers. What I would really like to see is this response except further fleshed out with examples and resources, e.g. "read these specific papers or books on how studies get rigged." I'll note that I do have some pause with this response since other responders contradicted at least one part of it, e.g., Kristin Lindquist saying not to worry about the funding source of a study. I'd like to see these (perhaps only surface-level) disagreements resolved. Overall though, really solid answer that deserves its karma.
Bucky [winner]: Bucky's answer is deliciously technical. Rather than discussing high-level qualitative consequences to pay attention to (e.g. funding source, has there been reproductions), Bucky dives and provides actual forumulas and guidance about sample sizes, effect sizes, etc. What's more, Bucky discusses how he applied this approach to concrete studies (80k's replication quiz) and the outcome. I love the detail of the reply and it being backed up by concrete usage. I will mention that Bucky opens by saying that he uses subconscious thresholds in his assessments but is interesting in discussing the levels other people use.
I do suspect that learning to apply the kinds of calculations Bucky points at is tricky and vulnerable to mistaken application. Probably a longer resource/more training is needed to be able to apply Bucky's approach successfully, but his answer at the least sets one on the right path.
Kristin Lindquist: Kristin's answer is really very solid but feels like it falls short of the leading responses in terms of depth and guidance and doesn't add too much, though I do appreciate the links that were included. It's a pretty good summary. Also one of the best formatted of all answers given. I would like to see waveman and Kristin reach agreement on the question of looking funding sources.
jimrandomh: Jim's answer was short but added important answers to the conversation that no one else had stated. I think his suggestion of ensuring you ask yourself about how you ended up reading a particular study is excellent and crucial. I'm also intrigued by his response that controlling for confounds is much, much harder than people typically think. I'd very much like to see a longer essay demonstrating this.
Elizabeth: I feel like this answer solidly reminds me think to about core epistemological questions when reading a study, e.g., "how do they know this?"
Romeostevensit: this answer added a few more things to look for not not included in other responses, e.g. giving more to authors who discuss what can't be concluded from their study. Also I like his mentioning that spurious effects can sneak into despite the honest intentions of moderately competent scientists. My experience with data analysis supports this. I'd like to see a discussion between Romeostenvsit and jimrandhomh since they both seem to have thoughts about confounds (and I further know they both have interest in nutrition research).
Charlie Steiner: Good additional detail in this one, e.g. the instruction to compare papers to other similar papers and general encouragement to get a sense of what methods are reasonable. This is a good answer, just not as good as the very top answers. Would like to see some concrete examples to learn from with this one. I appreciate the clarification that this response is for Condensed Matter Physics. I'd be curious to see how other researchers feel it generalizes to their domains.
whales: Good advice and they could be right that a lot of key knowledge is tacit (in the oral tradition) and not included in papers or textbooks. That seems like something well worth remembering. I'd be rather keen to see whales's course on layperson evaluation of science.
The Major: Response seems congruent with other answers but is much shorter and less detailed them.
It would be good know if offering prizes like this is helpful in producing counterfactually more and better responses. So, to all those who responded with the great answers, I have a question:
How did the offer of a prize influence your contribution? Did it make any difference? If so, how come?