There is no real question about whether most published research findings are false or not. We know that's the case due to replication attempts. Ioannides' paper isn't really _about_ plugging in specific numbers, or showing that a priori that must be the case, so I think you're going at it from a slightly wrong angle.
From another of Ioannidis's own papers:
Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged.
If 44% of those unchallenged studies in turn replicated, then total replication rates would be 54%. Of course, Ioannidis himself gives a possible reason why some of these haven't been replicated: "Sometimes the evidence from the original study may seem so overwhelming that further similar studies are deemed unethical to perform." So perhaps we should think that more than 44% of the unchallenged studies would replicate.
If we count the 16% that found relationships with weaker but still statistically significant effects as replications rather than failures to replicate, and add in 16% of the 24% of unchallenged studies, then we might expect that a total of 74% of biomedical papers in high-impact journals with over 1,000 citations have found a real effect. Is that legit? Well, it's his binary, not mine, and in WMPRFAF he's talking about the existence, not the strength, of relationships.
Although this paper looked at highly-cited papers, Ioannidis also notes that "The current analysis found that matched studies that were not so highly cited had a greater proportion of “negative” findings and similar or smaller proportions of contradicted results as the highly cited ones." I.e. less-highly-cited findings have fewer problems with lack of replication. So that 74% is, if anything, most likely a lower bound on replication rates in the biomedical literature more broadly.
Ioannidis has refuted himself.
I don't think that paper allows any such estimate because it's based on published results, which are highly biased toward "significant" findings. It's why, for example, in psychology meta-analyses have effect sizes 3x larger than those of registered replications. For an estimate of the replicability of a field you need something like the Many Labs project (~54% replication, median effect size 1/4 of the original study).
Just glancing at that Many Labs paper, it's looking specifically at psych studies replicable through a web browser. Who knows to what extent that generalizes to psych studies more broadly, or to biomedical research?
I don't think that paper allows any such estimate because it's based on published results, which are highly biased toward "significant" findings.
So it sounds like you're worried that a bunch of failed replication attempts got put in the file drawer, even after there was a published significant finding for the replication attempt to be pushing back against?
I think the OSC's reproducibility project is much more of what you're looking for, if you're worried that Many Labs is selecting only for a specific type of effect.
They focus on selecting studies quasi-randomly and use a variety of reproducibility measures (confidence interval, p-value, effect size magnitude + direction, subjective assessment). They find that around 30-50% of effects replicate, depending on the criteria used. They looked at 100 studies, in total.
I don't know enough about the biomedical field, but a brief search on the web yields the following links, which might be useful?
Preparing for a career in biomedical research, I thought it prudent to thoroughly read the leading expositor of profound skepticism toward my intended domain. I'm an undergraduate student with only a very basic understanding of statistics and zero professional experience in scientific research. This is my sentence-by-sentence reaction/live-blog to reading Ioaniddis's most famous paper.
I've put his original headings in bold, and quotes from his paper are indented. The one part I didn't respond to is the opening paragraph. I got the last word, so I'll give John the first word:
Modeling the Framework for False Positive Findings
Why do replication studies so often fail? One possibility is that scientists are good at predicting which ones are incorrect and targeting them for a repeat trial. Another is that there are a lot of false positives out there, which is the issue Ionannidis is grappling with.
However, note that he's not claiming that most of the research you're citing in your own work is most likely false. Science may have informal methods to separate the wheat from the chaff post-publication: journal prestige, referrals, mechanistic plausibility, durability, and perhaps others.
Instead, he's saying that if you could gather up every single research paper published last year, write each finding on a separate notecard, and draw them randomly out of a hat, most of the claims you drew would be false.
How widespread is this notion, and who are the most common offenders? Medical administrators? Clinicians? Researchers? To what extent is this a problem in the popular press, research articles, and textbooks? How is this notion represented in speech and the written word?
Where should students have their guard up against their own teachers and curriculums, and when can they let down their guard?
Hence, a study that finds p >= 0.05 is not considered a research finding by this definition. In this paper, Ionannidis is making claims only about the rate of false positives, not about the rate of false negatives.
In my undergraduate research class, students came up with their own research ideas. We were required to do a mini-study using the ion chromatographer. One of our group members was planning to become a dietician, and he'd heard that nitrites are linked to stomach cancer. We were curious about whether organic and non-organic apples contained different levels of nitrites. So we bought some of each, blended them up, extracted the juice, and ran several samples through the IC.
We found no statistical difference in the level of nitrites between the organic and inorganic apples, and a very low absolute level of nitrites. But there were high levels of sulfates. Our instructor considered this negative result a problem, a failure of the research. Rather than advising us to use the null result as a useful piece of information, she suggested that we figure out a reason why high levels of sulfates might be a problem, and use that as our research finding.
Well, drinking huge amounts of sulfates seems to cause some health effects, though nothing as exciting as stomach cancer. Overall, though, "The existing data do not identify a level of sulfate in drinking-water that is likely to cause adverse human health effects." I'm sure we managed to find some way to exaggerate the health risks of sulfates in apples in order to get through the assignment. But the experience left a bad taste in my mouth.
My first experience in a research class was not only being told that a null finding was a failure, but that my response to it should be to dredge the data and exaggerate its importance in order to produce something that looked superficially compelling. So that's one way this "misinterpretation" can look in real life.
"The prior probability of it being true before doing the study? How can anybody know that? Isn't that the reason we did the study in the first place?"
Don't worry, John's going to explain!
On this 2x2 table, we have four possibilities - two types of success, and two types of mistakes. The successes are when science discovers a real relationship, or disproves a fake one. The failures are when it mistakenly "discovers" a fake relationship, a false positive or type I error, or "disproves" a real one, a false negative or type II error. False positives, type I errors, are the type of problem Ionannidis is dealing with here.
Scientists have some educated guesses about what as-yet-untested hypotheses about relationships are reasonable enough to be worth a study. Is hair color linked to IQ? Is wake-up time linked to income? Is serotonin linked to depression? Sometimes they'll be right, other times they'll be wrong.
So let's pretend we knew how often scientists were right in their educated guess about the existence of a relationship. R isn't the ratio of findings where p >= 0.05 to findings where p < 0.05 - of null findings to non-null findings. Instead, it's a measure of how often scientists are actually correct when they do a study to test whether a relationship exists.
If scientists are making educated guesses to decide what to study, then R is a measure of just how educated their guesses really are. If three of their hypotheses are actually true for every two that are actually false, then R = 3 / 2 = 1.5. If they make 100 actually false conjectures for every one that is actually true, then R = 1 / 100 = .01
Let's imagine a psychologist is studying the relationship of hair color and personality type. Ioannidis is saying that we're only considering one of two cases:
Using our examples above, if our scientists make 3 actually true conjectures for every 2 actually false conjectures, then R = 1.5 and they have a 1.5/(1.5 + 1) = 0.6 = 60% chance of any given conjecture that a relationship exists being true. Following the hair color example above, if our scientists have R = 1.5 for their hypotheses about links between hair color and personality, then every time they run a test there is a 60% chance (3 in 5) that the relationship they're trying to detect actually exists.
That doesn't mean there's a 60% chance that they'll find it. That doesn't mean it's a particularly strong relationship. And of course, they have no real way of knowing that their R value is 1.5, because it's a measure of how often their guesses are actually true, not how often they replicate, how plausible they are. R is not directly measurable.
These are just more definitions. The power is the complement of the Type II error rate. If an actually true relationship exists and our study has an 80% chance of detecting it, then it also has a 20% (.2) Type II error rate - the chance of failing to detect it. If no actual relationship exists, but our study has a 5% chance of detecting one anyway, our Type I error rate (α) is .05.
If we magically could know the values of R, the Type I and Type II error rates, and the total number of research findings in a given field, we could determine the exact number of research findings that were correctly demonstrated to be actually true or actually false, and the number that were mistakenly found to be true or false.
Of course, we don't know those numbers.
Let's say that we run 100 studies, and 40 of the claims achieve formal statistical significance. That doesn't mean all 40 of them are actually true - some might be Type I errors. So let's say that 10 of the 40 formally statistically significant are actually true. That means that PPV is 10/40, or 25%. It means that when we achieve a statistically significant finding, it only has a 25% chance of being actually true.
Of course, that's just a made up number for illustrative purposes. It could be 99%, or 1% - who knows?
So the "false positive report probability" is just 1 - PPV. In the above example with a PPV of 25%, the FPRP would be 75%.
So we have a formula to calculate the PPV based on R and the chances of Type I and Type II errors.
So for example, imagine that our hair color/personality psychologist has a rate of 3 actually true conjectures to 2 actually false conjectures whenever he puts a hypothesis to the test. His R is 3/2 = 1.5.
If he has a 10% chance of missing an actually true relationship, and a 5% chance of finding an actually false relationship, then β = 0.1 and α = 0.05. In this case, (1 - 0.1) * 1.5 = 1.35, which is greater than α, meaning that his research findings are more likely true than false.
On the other hand, imagine that our psychologist has a pretty poor idea of how hair color and personality link up, so every time he conjectures a relationship and puts it to the test, only 1 relationship is real for every 100 he tests. His R value is 1/100 = .01. Using the same values for β and α, our formula is (1 - .1) * .01 = .009, which is less than α, so most of his findings that achieve statistical significance will actually be false.
In general, the more educated his guesses are, the more sensitive and specific his tests are for the effect he's examining, the more likely any relationships he finds will be real.
By a, I believe Ioannidis means α. He's just filling in the commonly-used threshold of "statistical significance."
So far, Ioannidis has just been offering an equation to model the rate of false positive findings, if we knew the values of R, α, and β and had access to every single experiment our psychologist ever did. Now he's pointing out that we don't have access to every experiment by our psychologist, but rather a biased sample - just the data he was able to get published.
And furthermore, there might be some other hair color/personality psychologists studying the same questions independently. If enough of them look for a relationship that doesn't exist, say between hair color and conscientiousness, then one of them will eventually find a link just due to random sampling error. For example, they'll happen to get a sample of particularly conscientious blondes due to random chance and publish the finding, even though other psychologists studying the same relationship didn't find a link. And the link is indeed not real - this team just happened to usher some particularly hard working blondes into their lab for no other reason than coincidence.
Now, although publication bias and repeated independent testing are real phenomena, Ioannidis so far has made no claims about how common they are. He's just identifying that they probably have some influence in inflating our false positive rate - or, in other words, decreasing our PPV, the proportion of statistically significant findings that are actually true.
Just like Ioannidis was able to give us equations to calculate our statistically significant + actually true, false positive, false negative, and statistically insignificant + actually false rates, he's going to give us equations that can take into account the level of bias and the effect of independent repeat testing on the PPV.
Bias
What are some examples?
Maybe a psychologist eyeballs whether a test subject has red or blond hair, and is determining whether they're conscientious by asking them to wash some dishes in the psych lab kitchen and deciding whether the dishes were clean enough. If he thinks that blondes are conscientious, he might inspect the plate before deciding whether the subject who washed them was strawberry blonde or a redhead.
Alternatively, he might dredge the data, or choose a statistical analysis that is more likely to give a statistically significant result in a borderline case.
So if u = 0.1, then 10% of the studies that should have produced null results have been twisted and distorted into statistically significant findings. Again, we have no idea what u is (so far at least); it's just a formalization of this concept.
If our psychology researcher lets in 100 subjects to his study, and happens by coincidence to get some particularly conscientious blondes, that's not bias. We just call it chance.
This could mean that our researcher has 5 different options for which statistical test to use on his data. Some are more strict, so that it looks impressive if the data is still significant. Others are more lax, but can be reasonable choices under some circumstances. Our researcher might try all 5, and pick the most strict-looking test that still produces a significant result. That's a form of bias.
Another is if he doesn't have any better research ideas than the hair color/personality link. So if his first study produces a null result, he locks the data in his file drawer and runs the study again. He repeats this until he happens to get some particularly hard-working blondes, and then publishes just that data as his finding. Of course, if he does this, he's not finding a real relationship. It's like somebody who films himself flipping a quarter thousands of times until he gets 6 heads in a row, and then posts the clip on Youtube and claims he's mastered the art of flipping a coin and getting heads every time.
Ioannidis is assuming that it doesn't matter whether or not there really is a link between hair color and personality - our researcher will still behave in this biased manner either way. This is an a priori, intuitive assumption that he is making about the behavior of researchers. Why is it OK for him to make this assumption?
Remember, u is the proportion of studies that shouldn't find a relationship, but do anyway, specifically due to bias.
Imagine the hair color/dish washing study used an automated hair-color-o-meter and dish-inspector, which would automatically judge both the hair color of the subjects and the cleanliness of the dishes. Furthermore, the scientist pre-registers the study and analysis plan in advance. Everything is completely roboticized - he's not even physically present at the lab to influence how things proceed, and even the contents of the paper are pre-written. The data flows straight from the computers that measure it to another program that applies the pre-registered statistical analysis formula, spits out the result, and then auto-generates the text of the paper. All bias has been eliminated from the study, meaning that u has dropped to 0.
Note that there is still a chance of a false positive finding, even though u = 0.
Now, Ioannidis is saying that it's reasonable to assume that the psychologist's decision of whether or not to roboticize his studies and eliminate bias has nothing to do with whether or not there really is a link between hair color and personality.
Does that seem reasonable to you? It's an empirical claim, and one offered with no supporting evidence. You're allowed to have your own opinion.
Here's an argument against Ioannidis's assumption:
Maybe researchers end up in labs that are specialized to study a certain relationship. If that relationship actually exists, then they develop a culture of integrity, because they have success in generating significant findings using honest research practices.
On the other hand, if there is no real relationship, they're all too invested, emotionally and materially, in the topic to admit that the relationship is false. So they develop a culture of corruption. They fudge little things and big things, justifying it to themselves and others, until they're able to pretty reliably crank out statistically significant, but false findings. They keep indoctrinating new grad students and filter out the people who can't stomach the bad behavior.
If this "corruption model" is true, then there is a relationship between the existence of a true relationship and u, the proportion of studies that lead to publications specifically due to bias.
Let's bear in mind that some of Ioannidis's further claims might hinge on this assumption he's making without any evidence.
If the extent of bias has no relationship with the existence of a real relationship, then this equation lets us model the chance that a published finding is true given all the other mystery variables we've discussed earlier. As u - the extent of bias - increases, the chance that any given published finding is actually true decreases.
Any correlation between the extent of bias and existence of a real relationship will introduce error into this equation.
If there is no strong link between bias and actual truth or falsehood.
We can plug in fake values for all these different variables and see what the value we really care about, the PPV, will be.
So maybe our hair psychologist has no good test for conscientiousness or hair color, and therefore isn't able to find an effect that is actually there. Maybe he does have the data to detect a real effect, but dies of a heart attack and it never gets published. Maybe the Hair Equity Association threatens to end the career of any scientist publishing findings that hair color and personality are linked.
Guess we'll have to make some more assumptions, then!
Why? Guess we'll just have to take Prof. Ioannidis's word for it. Man, honestly, at least with the last assumption about the lack of a link between bias and the existence of a real effect, he offered a reason why.
Let's just note that this is an assumption about the frequency of measurement error and inefficient data use, which is in turn based on an assumption about the link between technological advances and measurement error, and an assumption about the change in researchers' sophistication...
It's assumptions all the way down, I guess.
Can we question them?
Maybe as technology proceeds, we're able to try and detect subtler and more complex effects using our new tools. We push the boundaries of our methods. We fail to fully exploit the gigantic amounts of data that are available to us. And I just have no prior expectation that people are any more "sophisticated" now than they used to be. What does that even mean?
So we can make up some numbers and visualize the combinations in a graph.
Let's say that blondes really do clean the dishes better than people with other hair colors. But on the day of our study, the subjects all just so happen to work equally hard at washing the dishes, so there's a null finding. Ioannidis is reminding us that this isn't bias - just the effect of chance.
Testing by Several Independent Teams
So there might be a hair color/personality lab in Beijing and another in New York City, both running their own versions of the dish-washing study.
Another uncited empirical claim, but you know best, John! And honestly, this does seem plausible to me.
So Ioannidis is saying that some scientists have a habit of focusing on just the output of the Beijing lab, or just the New York City lab, but not looking at the output of both labs as they probe the link between hair color and personality. So if the Beijing lab finds a link, but the New York City lab doesn't, then the people following the Beijing lab will have an inflated opinion of the overall, global evidence of the link (and the people following the New York City data will have the opposite problem).
Try giving a TED talk on hair color and personality based around a lack of a relationship. Hard to do. So if you're trying to give that talk, you're going to exclusively talk about the Beijing findings, and completely ignore the contradictory data out of NYC. I'm sure you can think of other examples where this sort of thing goes on. Ioannidis is trying to tell us that we have a habit of ignoring, or just failing to seek out, contradictory or inconvenient data. We can make stories more compelling by ignoring context, or by gathering supporting evidence into a giant mass, then using it to dismiss each contradictory finding as it pops up.
Imagine we knew the values of all our "mystery variables" (R, β, and α, not considering bias), but hadn't run any studies yet. Ioannidis has another equation to tell us the chance that at least one study would turn up a significant finding if we did run a given number of studies.
1 − β is the power of the study - the chance of not getting a false negative (Type II error). So let's say we have α = .05, meaning that we require a 5% or lower chance of a false positive to consider a study "statistically significant." In that case, unless our studies are underpowered, more independent studies on the same question will tend to decrease the PPV.
You'd typically think that more high-powered studies would be a good thing. Why run one study on the link between hair color and conscientiousness when you could run ten such studies?
Well, let's say there really isn't a link. That means that a null result is getting at the truth.
You run one study, and find no result. Now you run another one, and again, no result. So far, you have a perfect track record. If you run another 100 studies, though, you might find a relationship - even though none exists - which will make your track record worse. Doing more testing actually decreased your accuracy.
Another "plug in the mystery numbers and see what the graph looks like" figure.
Corollaries
Ioannidis is able to give us some rules of thumb based on the mathematical models he's presented so far.
Note that while some of his empirical assumptions are separate from his mathematical models, one of his assumptions - that bias and the existence of actual relationships are not linked - is baked into his mathematical model. Insofar as there is a relationship here, and insofar as his corollaries depend on this fact, his conclusions here will be suspect.
Box 1. An Example: Science at Low Pre-Study Odds
This means that we're looking for any possible genetic links with schizophrenia.
So in this field, we actually do have some information on some of our "mystery numbers." If the odds ratio is 1, that means that there's no relationship between a gene polymorphism and schizophrenia. If it's greater or less than 1, there is a relationship. So an odds ratio of 1.3 means that a particular gene polymorphism is 30% more likely in the presence of schizophrenia and 30% less likely in the absence of it.
So if we tried to pick one of our candidate gene polymorphisms at random, there would be a 0.01% chance that it's linked to schizophrenia.
We're imagining that 40% of the time, we'll fail to find a true effect because our methods aren't powerful enough.
Every single gene we look at comes with a chance of turning up a false positive. That's 10,000 chances to generate a false positive finding. And so even if our test is pretty specific, it still might turn up quite a few false positives.
By contrast, only 10 of the genes have a chance to generate a true positive finding. Since our study isn't that powerful, we might miss a fair number of those true positives.
Taken together, using these numbers, we're almost certainly going to get some positive findings - and they're almost guaranteed to be false positives, even though there's still much better-than-chance (but still tiny!) odds that they're real.
Let's say I set my alarm for a completely random time and hid it. You are trying to guess when it will go off. There are 86,400 seconds in a day, so by guessing at random, you've got a 1/86,400 chance of getting it right.
Now let's say that you can see my fingers moving on the dials of the alarm clock as I set it, but can't actually see what I press. You have to use my finger motions to guess what buttons I might have pressed. Even if this information can rule out enough possibilities to improve your guess by 10x, you still only have a 1/8,640 chance of guessing correctly.
Similarly, our hypothetical gene association test has improved our chances of guessing the real genes linked with schizophrenia, but the odds that our candidates are in fact correct is still very low.
This is where the bias sets in. They'd already identified a bunch of fake associations (and maybe a few real ones mixed in). Now they're adding a bunch more fake associations, further diluting our chances of figuring out which links are real.
The point is that there are a lot of plausible-sounding analysis choices that are in fact nothing more than distortions used to invent a finding. For example, they've got a hard drive full of patient diagnoses. They can choose a threshold for whether or not a patient counts as "schizophrenic," and select the threshold that gives them the most associations while still sounding like a reasonable definition of schizophrenia.
If true, this suggests that through ignorance or incentivized self-justification, there's enough researchers willing to do these kind of shenanigans to make a market for it. And probably there are convincing-sounding salesmen able to reassure the researchers that what they're doing is fine, normal, and even mission-critical to doing good science. "You wouldn't want to miss a link between genes and schizophrenia, would you? People might die because of your negligence if you don't use our software!!!"
So if 10% of the non-relationships get reported as significant due to bias and there's no link between bias and existence of a relationship, then only .044% of the supposed "links" are actually real.
And the more times we re-run this same experiment, the less able we'll be to pick out the true links from amongst the false ones. Intuitively, this seems strange. Couldn't we just look at which genes have the most overlap between the ten studies - in other words, do a meta-analysis?
I believe the whole issue here is that Ioannidis is presuming that we're not doing a meta-analysis or in any other way comparing the results between these studies.
That makes sense. Fewer subjects means that random chance can have a bigger effect, tending to create significance where there is none.
This is just based on the equations, not any empirical assumptions.
This doesn't necessarily mean that cardiology is better science than molecular predictors, because there are other factors involved. It just means that molecular predictors could improve the reliability of their findings by increasing the sample size.
If there are links between hair color and various aspects of personality, but they're very small, then any given "discovery" (say a relationship between hair color and Machiavellianism) has a greater chance of being random noise in the data rather than a real effect.
So the power of a study isn't equivalent to, say, the magnification strength of a microscope. It's more like the ability of the microscope to see the thing you're trying to look at. That's a function not only of magnification strength, but also the size of the thing under observation. Are you looking for an insect, a tardigrade, a eukaryotic cell, a bacteria, or a virus?
It's easier to detect obvious effects than subtle effects. That's important to bear in mind, since our bodies are very complex machines, and it's often very hard to see how all the little components add up to a big effect that we care about from a practical standpoint, such as the chance of getting a disease.
I.e., it's increasing looking for subtle, difficult-to-detect effects.
Ceterus paribus, John, ceterus paribus. They could gather more data, use more reliable methods, or get better at predicting in advance which conjectures are true in order to compensate.
ALL ELSE BEING EQUAL. Is it reasonable to assume that researchers step up their measurement game in proportion to the subtlety of the effects they're looking for? Or should we assume that researchers trying to study genetic risk factors are using the same, say, sample sizes, as are used to determine whether there's a link between seat belt usage and car crash mortality?
What I really wish Ioannidis had done here is shown how that 1.05 number interacts with his equations to make it intractable to produce true findings. This paragraph is the first time he used the term "effect size" in the whole paper, so it's not easy to know if 1.05 is supposed to be R, or some other number.
It's conceivable to me, as a non-statistician, that small effect sizes could make it exponentially more difficult to find a real effect.
But couldn't these fields increase their sample sizes and measurement techniques to compensate for the subtlety of the effects they're looking for? Couldn't the genome-wide association study be repeated on just the relationships discovered the first time around, or a meta-analysis be performed, in order to separate the wheat from the chaff? I understand there's a file-drawer problem involved here, but Ioannidis has already semi-written-off two fields as "utopian endeavors" before he's even addressed this obvious rebuttal.
So as a non-expert, I have to basically decide whether I think he's leaving out these details because they're actually not very important, or whether he himself is biasing his own analysis to make it seem like a bigger deal than it really is.
So "shotgun research" is going to get more false positives than "rifle research."
Ceterus paribus.
That makes perfect sense, and addresses my objection above: if our first study reduces 100,000 candidate genes to 10,000, and our second reduces that to 100, our third study might reliably identify just 2 genes which we can feel 90% sure are candidate genes linked with schizophrenia. We can also get there via meta-analyses or just getting bigger sample sizes.
One constructive way of looking at this corollary is that we can see research as a process of refinement, like extracting valuable minerals from ore. We start with enormous numbers of possible relationships and hack off a lot of the rock, even though we also lose some of the gold. We do repeat testing, meta-analysis, and speculate about mechanisms, culling the false positives, until finally at the end we're left with a few high-carat relationships.
To thoughtfully interpret a study, we need to have a sense for what function it serves in the pipeline. If it's a genome-wide association study, we shouldn't presume that we can pick out the real genetic links from all those candidates. And of course, if there's a lot of bias in our research, then even a lot of refinement might not be enough to get any gold out of the rocks.
This is a great reason to do pre-registration and automate data collection. Rather than thinking about whether we buy the researchers' definitions and designs, we just demand that they decide what they're going to do in advance. Tie yourself to the mast, Odysseus!
Of course, the degree to which flexibility creates genuine problems will depend on the field, the researchers in question, etc. But pre-registration seems relatively easy to do and like it would be really beneficial. So why wouldn't you Just Do It?
That's a hopeful sign. But I know there's a Garbage In, Garbage Out problem as well. If the studies they're based on aren't pre-registered, won't we still have problems?
Or else it'll give undeserved credibility to meta-analyses that are covering up a serious file drawer problem. Seems to me like this is a bottom-up problem more than a top-down problem, unfortunately.
It's going to be a balance sometimes, right? We might face a choice between an outcome measure that's unequivocal and an outcome that precisely targets what we're trying to study. Ideally, we just want to do lots of studies, looking at lots of related outcomes, and not rely too much on any one study or measure. This is old accepted wisdom.
So just as, within a field, there might be a pipeline by which the ore of ideas gets refined into the gold of real effects, there might be whole fields that are still in early days, just starting to figure out how to even begin trying to measure the phenomenon of interest. Being savvy about that aspect of the field would also be important to interpreting the science taking place within it.
Of course, once again, the absolute age of the field will only be one aspect of how we guess the reliability of its methods. Artificial intelligence research may still be in relatively early days, but since it takes place on computers, it might be easier to gather lots of data or standardize the tests with precision.
So even if journals published all null findings, researchers might not submit all their null findings to the journal in the first place. They might also manipulate their data to obtain statistical significance, just because that's a splashier outcome. It sounds like pre-registration would help with this problem. Both researchers and journals need to police their own behavior to correct this problem.
I don't see how this corollary follows from Ioannidis's equations. It seems to flow from his empirical assumptions about how the world works. It might be reasonable, but it's an empirical question, unlike corollaries 1-4.
Plausible, but the corollary as stated is about the existence of "financial and other interests," not "conflicts of interest." For example, I could run a nonprofit lab studying whether red meat is associated with heart disease. There's tremendous money in the meat industry and in the pharmaceutical industry. But would you say that my nonprofit lab is running a greater risk of conflict of interest than a for-profit company doing a trial on its own spina bifida drug, just because there's more money in the red meat and heart-disease-treatment industry than spina bifida treatments?
I also think we need to be careful with how we think about "prejudice." If Ioannidis means that the researchers themselves are prejudices about what their findings will be, it does seem plausible that they'll find ways to distort their study to get the results they want. On the other hand, if a certain field is politically controversial, we can imagine many possibilities. Maybe there are two positions, and both are equally prejudice-laden. Maybe one of them is prejudice-laden, and the other is supported by the facts. Maybe the boundaries between one field and another are difficult to determine.
Without the kinds of clear-cut, pre-registered, well-vetted methods to determine what conflicts of interest and prejudice exist in a given field, and where the boundaries of that field lie, how are we to actually use these corollaries to evaluate the quality of evidence a field is producing? In the end, we're right back to where we started: if you think a finding is bullshit because the researchers are prejudiced or motivated by financial interests, the onus is on you to come up with disconfirming evidence.
That does strike me as a real effect, but I'm not sure about the magnitude of this effect size. So for any given claim of biased research, Ioannidis would counsel me to bear in mind that the smaller the effect of conflicts of interest, the fewer actually true claims of biased false positives we will find.
I'm sure there is a way to study this. For example, imagine there's an overwhelmingly disconfirming study that gets published against a certain relationship. How does that study affect the rate of new studies carried out on that relationship?
Many in relative or in absolute terms? May be conducted? This feels like a weasel-worded sentence to give us the impression of cynicism where none may be warranted. Again, I'm not a scientist (yet), so you can draw your own conclusions.
Just as you can list the many ways in which the flow of knowledge can be blocked, you can also think of the many ways knowledge can flow around barriers. Imagine a field is stymied by a few prestigious investigators protecting their pet theory by blocking competition through the peer review process. To what extent do you think it's plausible that their dominion would block these contradictory findings, and for how long?
Can the scientists who discovered the contradictory findings get them published in a less prestigious journal? Can they publish some other way? Can they have behind-the-scenes discussions at conferences? Can they conspire to produce overwhelming evidence to the contrary? Can they mobilize to push out these prestigious propaganda-pushers?
Why, in this story, are the prestigious investigators so formidable, and the researchers they're repressing so milquetoast, so weak? Why should we have any reason to think that? Here I am, writing a sentence-by-sentence breakdown of a famous paper with over 8,000 citations by a legendary statistician. Nothings stopping me from posting it on the internet. If people think it's worth reading, they will.
In light of that, how should we interpret the several uncited empirical opinions you've offered earlier in this paper, Prof. Ioannidis?
Also, what precisely does this have to do with prejudice and conflicts of interest? I thought this was about prejudice and conflicts of interest leading to distorted studies, not opinions getting proferred in the absence of a study?
This is the one corollary that doesn't even begin to make immediate, intuitive sense to me.
So if we assume that fields A and B are identical in every respect except the number of teams involved, then the one with fewer teams - i.e. better coordination, more cohesiveness in how data is collected and interpreted - will find a greater proportion of true findings. That makes sense. But I don't think you can use the sheer number of teams working in a field as a referendum on how likely its findings are to be true, relative to other fields. You can only say that adding more teams to the same field will tend to lead to worse coordination, more repeat studies with less meta-analysis.
That makes perfect sense, but a way to phrase that constructively would be to say that we should try to improve coordination between teams working on the same problem. This way makes it sound like we should be automatically suspicious of hot fields, and I don't see a reason for that a priori. Maybe hot fields just have a lot to chew on. Maybe they attract better researchers and enough funding to compensate for the coordination difficulties. Who knows?
Of course, it's reasonable to assume that sometimes Ioannidis is exactly right. But this statement here is a form of cherry-picking. Just because poor research coordination and bias and all sorts of other problematic practices can and sometimes even do lead to swings between excitement and utter disappointment, doesn't mean that it happens all the time. It doesn't mean that the sheer number of teams working in a scientific field is very well correlated with the chance of such an event.
On the other hand, just like in business, scientists decide what field to enter, differentiate themselves from the competition, and specialize or coordinate to focus on different aspects of the same issue, in order to avoid exactly this problem.
Here's what I don't get. We're imagining that there are a lot of false hypotheses, and only a few true relationships. Any given false positive is drawn from the much larger pool of false hypotheses. So why would these teams be feeling a sense of urgency that their competition will beat them to the punch, if they're all racing to publish false positives?
We'd only expect urgency if two competing teams are racing to publish a real relationship. There's plenty of bullshit to go around! From that point of view, the more of a race-to-publish dynamic we see, the more we should expect that the finding is in fact true.
On the other hand, if we see lots and lots of impressive positive results, all different, and we're looking at a new field, with hazy measurements, then we should be getting suspicious. That sounds a lot like social psychology.
A metaphor here might be the importance of aseptic technique in tissue culturing. We want to use lots of safeguards to prevent microbial contamination. If one fails, the others can often safeguard the culture. But if all of our safeguards fail at once, then we should be really worried about the risk of contamination spreading throughout the lab.
So why don't researchers publish their 19 file-drawer null-result studies after the first false positive association on the same question makes it to press? Does that mean that prejudice is even worse than we fear? Are scientists afraid to criticize each other to the point that they'll decide to forgo publication? Even if the data's old, if you're looking at somebody who found a positive association when you know you've got contradictory data locked up in your file drawer, why wouldn't you say "Great! I'll just do a repeat of that study - I already have some practice at it, after all."
It may, it may, it may. That word, "may," turns up a lot in this paper.
Originally, the idea was that journals wouldn't publish null results. Now it's that they will, but scientists didn't find it "attractive" to publish until the false-positive became a prestigious target. So what's the implication? That scientists are doing studies on some obscure phenomenon, getting a null result, then locking it in a file drawer until some other team's false positive makes it into Nature, and only then publishing their contradictory null result in order to contradict "an article that got published in Nature?"
Now that's some 4-dimensional chess. Either scientists are the most Machiavellian crew around, and really picked the wrong profession, or else maybe Prof. Ioannidis is choosing his words to make a worst-case scenario sound like a common phenomenon.
Has been coined by whom? Let's just check citation 29:
"The term Proteus phenomenon has been coined" sure sounds a lot more sciency than "I made up the term 'Proteus phenomenon'."
I'm going to take Ioannidis's empirical claim here as scientific truth. I'm extremely glad he's making effort to do empirical research on the phenomena he's worried about, and I'm dead certain that he's not just pulling all this out of his butt.
But again, if researchers are racing each other to publish, then you'd expect that they've converged on a single truth, not a single falsehood. I guess I'd need to read his empirical paper to decide if it made a compelling case that the "Proteus Phenomenon" in molecular genetics is due to having too many teams working on the same problem, or due to some other cause. Maybe they're just misunderstanding the false positive rate of the techniques they're using, getting their hopes up, and then having it all come crashing down.
This is what I've been complaining about throughout this section. Can't improvements in one area compensate for shortcomings in another?
So for Ioannidis, using a more powerful study to look for a smaller effect is... bad? So what, we should just not look for small effects?
It may! It may! It may!
Or they may create a barrier that stifles efforts... or restricts efforts... or hinders efforts... or slows efforts... or interferes with efforts... or complicates efforts... or really doesn't do very much to efforts at all... or makes efforts look all the more necessary... or makes scientists work all the harder out of spite... or creates a political faction specifically to oppose their efforts in an equal and opposite reaction... or leads to lasting suspicion of industry-funded studies... or leads to lasting suspicion of studies that support industry claims even in the absence of conflicts of interest...
Didn't you just say that larger studies can make the predictive value worse?
"These corollaries consider each factor separately, but these factors often influence each other. For example, investigators working in fields where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fields where true effect sizes are perceived to be large."
Why yes, in context you did.
I'm quite confused now.
Anyway, I'm glad to hear Prof. Ioannidis say that hot fields with strong invested interests could be good or bad for the field. Of course, the rhetorical structure of this paper has tended to dwell on the bad. I'm scratching my head, thinking that there's almost a parallel issue... like... I don't know, maybe a scientist who crams a bunch of contradictory findings into a file drawer until we're sold enough on a false positive that publishing the null result makes for a splashy article?
Most Research Findings Are False for Most Research Designs and for Most Fields
As shown? Where did you show this? As stated, PPV depends on R, u, and the chance of Type I and Type II errors, and it's rare that we'll have strong evidence about what their true values are. This claim here is uncited, and I truly don't see how it follows from the prior argument.
And furthermore, "true findings" imagines that every claim of statistical significance is read on a grand, unified list, from all the papers in the field. If that very stupid approach to interpretation was actually how scientists interpreted papers, it would be a huge problem. And I'm sure it is sometimes, and even more often how motivated interests outside of science will behave, waving a paper around and claiming it supports their hokum about some naturopathic treatment or whatever.
But just the fact that a field of science produces claims that are statistically significant doesn't mean that its practitioners are such blathering idiots that they think that every statistically significant finding is God's own truth. Does Ioannidis think he's the only one who's realized that a GWAS is going to turn up a lot of false positives?
How often? How much effort? How do you define "field?" Are we talking about gigantic expenditures of scientific effort with nothing to show for it? Or are we talking about a field that amounted to a few scientists, for a few years, throwing up results, getting some attention, getting refuted, and eventually shutting down, amounting in total to 0.0001% of total scientific effort? I don't know, and I can't know, because this is an uncited empirical claims, without even an example as a reference point. Sure I can think up examples on my own, but I really have no idea whether we're talking about a perpetual disaster or a few dramatic blow-ups here and there.
So Ioannidis is imagining a group of scientists studying the relationship between a person's hair color and the chance of their getting heads on a coin flip. Any relationship discovered by these scientists would be purely due to bias.
Or, as a more obviously null example, the conjecture that people's hair is related to their chance of getting heads on a fair coin flip.
Or that the scientific literature has examined 60 aspects of hair color - shade, texture, length, and so on - and found all of them to be related to the chance of getting heads to a small but meaningful extent.
Well, hair has nothing to do with the chance of a coin coming up heads, so all these claims are just measuring the researchers' tendency to manipulate their data, data dredge, hide null results, and so on.
Earlier in this paper, Ioannidis said that stronger effect sizes improve the PPV. Now he's saying that they might also just indicate the bias in the field itself.
But another of his premises was that bias and the existence of an actual relationship were most likely unrelated.
You can't have it both ways, John. Should we be worried because we see a large effect size (indicating bias)? Or should we be worried because we see a small effect size (indicating that findings are more likely to be false positives)?
So if a field is just getting its feet on the ground figuring out what's related to what, then we should be extra skeptical of any individual finding. Fair enough. That doesn't mean the field will stay in that immature state forever, or that we should be permanently skeptical of its conclusions. It just means we need to have an appreciation that scientific maturity of a field takes time. It is equally an argument for being more patient with a field in the early days, while it figures out its methods and mechanisms.
For a field in early days, too much bias can potentially kill its ability to figure out the real relationships. We can imagine a field struggling to find a reliable, significant result. The grant money starts to dry up. So the researchers who are most invested in it find ways to manipulate their studies so that a couple of relationships suddenly start getting confirmed, again and again. And now they're deeply invested in protecting those spurious "relationships." And if other researchers get deeply attached to the subject area, intrigued by the strength of these findings, then the cycle can continue.
So we have a quandary. How can we distinguish a field in which a few reliable, true findings have been refined out of the conjectural ore, from a field in which a Franken-finding is running around murdering the truth?
Yes, if we totally accept it. But should we?
Under this view, large and highly significant effects are still cause for excitement. It's like the GWAS example earlier. A large, highly significant effect might still be due to nothing but bias. But imagine a study comes out with an effect size of 1.05. Now imagine that the study actually had an effect size of 1.50. Now imagine that the effect size was 2.0. Did the actual truth of the effect seem more likely or less likely as the effect size increased?
It probably depends on context. If a psychologist comes out tomorrow saying that blondes are extraordinarily more neurotic than other hair colors, I think I would have noticed, and would guess that they screwed up their study somehow. On the other hand, if a GWAS finds that a particular genetic polymorphism has an extraordinarily strong link to prostate cancer, I have no particular reason to think that bias is responsible. It's the way that the strength of the effect size fits in to my prior knowledge of the problem under study that informs my interpretation, not the sheer size of the effect alone. And I really can't see why a super-low p-value would make me think that bias is more likely to be a culprit.
Ioannidis is not giving any reason why we'd think that modern research is particularly likely to be susceptible to bias.
I admit that he's the PhD statistician, not I, but if I ignore his credentials and just look at the strength of his argument, I'm just not seeing it. He needs to try harder to convince me.
Which they should be doing anyway, after they sleep off their drunk from all the champagne-drinking.
Which would be a logical conclusion, if there is a low prior probability of any given decades-old field being a "null field."
Once again, Ioannidis presents all these scary barriers to the truth, and once he's got us sold on them, he presents us with the solution already knew existed as his own idea. Truth flows around the barriers.
Yes, please, I would love to see some actual empirical data on this matter. Of course, we have to remember that the people conducting this research into bias may themselves be equally, if not more, susceptible to bias. After all, who will watch the watchers?
And since we've already established in corollary 4 that greater flexibility in the study design can lead to greater rates of false positives, I'm curious to know how bias researchers will find sufficiently rigid definitions and measures of the field under study and the way bias is measured. With that caveat, I sincerely say more power to anybody who does this research!
How Can We Improve the Situation?
Well, I don't think you've come anywhere close to proving that most research findings are false in this paper, so no, I wouldn't say it's unavoidable. In fact, having heard about this paper as some critically important lens for interpreting biomedical research for years now, I'm absolutely shocked at the weakness of the empirical evidence underlying its actual content.
But whatever the PPV is, there are clearly some steps we can take to improve it: pre-registration, making it easier for scientists to publish null results, better measures and greater sophistication, etc.
Except that we can definitely be sure that most published research findings are false, of course.
So one of those large studies you said earlier could also lead to more false positives? Oh, and a "low-bias" meta-analysis might reduce bias? Great idea, keep 'em coming!
And what if they can't be avoided? Then maybe we just need to do more studies... Except wait, that was bad too...
There are something like 2.5 million scientific papers published every year. Trillions of research questions? A trillion is a million million. Is the average research paper posing 400,000 research questions? Or is he just tossing around these numbers as a rhetorical device?
This is a little hard to take from a guy advocating such fastidiousness with the numbers.
So we want to be strategic with our resources. Rather than throwing a lot of money and brainpower at questions with a low a priori chance of being correct, we should focus our energies on reliably answering questions, where possible. Or alternatively, figuring out efficient ways to test many questions with a low a priori chance of being correct... like with those high-throughput studies he was talking about earlier...
So we should only do a 10,000-subject test on the association of hair color and personality when it stands a chance of definitively supporting or rejecting the existence of a broad link between hair color and personality in general, but not when it only stands to support or disprove some narrow link between a particular hair color and a particular personality trait. That seems like a good strategy. But then again, we also need big studies to detect or disprove small effects.
Actually, maybe we just need more funding for science so that we can adequately test all of these claims - both making room for newer fields to get established and for more mature fields to definitively answer the questions they've been wrestling with for years and decades.
We should also encourage young scientists to learn techniques that allow them to do the more powerful tests. Are we sufficiently exploiting the massive data being cranked out by the U.K. Biobank? How many biomedical grad students are we failing to teach to do big-data research, and instead to waste their efforts producing small data sets in separate labs because their mentors only know how to do wet lab work?
Or more generally, if you have no plausible reason to think a big expensive study will find a real result, then you're just recruiting scientists to write a script for the most boring advertisement in the world.
And to be quite clear, this means that a deeply unimportant, miniscule effect, so tiny we shouldn't be worrying about it at all, is more likely to be found by an extremely large study. It doesn't mean that the results of extremely large studies are more likely to be trivial, or that increasingly large study size means we should have less trust that the relationship it detects is real.
Beware the man of one study.
A modest proposal: what if we just feed anybody who's had a study published in Nature to hungry grad students?
I've worked as a music teacher for children for a decade. And this has the ring of truth to me. In my early years, I would constantly have ideas about children and their personalities, techniques that seemed to help get particular ideas across, ways to influence the kids' behavior and maintain my energy levels. The vast majority of these ideas wouldn't pan out, or would only work on one particular kid.
But a few ideas did stick, and occasionally they would redefine the way I taught. I had to really believe in each one, stick with it, be willing to give it a fair chance, tweak it, but set it aside if it didn't work out.
The nice thing is that I got to directly observe the ideas in action, and because I like to see my students succeed, the incentive structure was really nice. It seems logical to me that we need some researchers to babble out new hypotheses with very little barrier, and then a series of ever-more-severe prunings until we arrive at the Cochrane database.
Just a thought. If other, less formal, less visible methods might work to reduce bias, is there maybe a chance that they're already in place to some extent? This paper does a lot of priming us to believe that research is a lot of Wild West faith-healing quackery. What if instead scientists care about the truth and have found ways to circumvent the biases of their field in ways that John Ioannidis hasn't noticed or seen fit to mention?
What if the studies that focus more on creative hypothesis-generation and less on rigorous falsification are just using methods appropriate to the wide end of the research funnel?
In that case, we'd probably need to create grant-making structures that are tailored to hypothesis-generating or hypothesis-confirming/falsifying research. A study that is expected to generate 1,000 relationships, of which 2 will eventually be proven real and novel, might be as valuable as a study that takes a single pretty-solid hypothesis and convincingly determines that it's true.
But the NIH probably wants to be clear on which type of evidence it's buying. And until everybody's on board with the idea that a 2/1000 hit rate can be worthwhile if the true findings can be teased out, and are novel and important enough, then early-stage hypothesis-generating research will probably keep on pretending that its findings are a lot more solid than they are, just to get the statistics cops like John Ioannidis off their backs.
I mean, just a thought. I don't really know whether this story I'm spinning really reflects grant-making dynamics in scientific research. But I feel like all of Ioannidis's assertions and assumptions are provoking an equal and opposite reaction from me. And I think that's the proper way to read a paper like this. Counter hypothesis with hypothesis, and let the empirical data be the ultimate arbiter.
A good idea, and I'd love to see the data. Of course, when Ioannidis says he suspects "several" classics will fail the test, that's only interesting if we know the sheer number of studies he's imagining will be carried out, and whether he's defining a "classic" as a study considered overwhelmingly confirmed, or as a popular study that doesn't actually have as much support as you'd expect, given the number of times it gets referenced.
After all the drama, this seems like a perfectly reasonable conclusion. Yes, we shouldn't take every relationship turned up in the first GWAS done for some disease as even likely to be correct. "More research is required," as they say.
If it's impossible to decipher, then that means we should be suspicious of any study that claims to have some fancy statistical method to detect data dredging. But it does seem possible, to some extent.
For example, if an author publishes some very weirdly-specific data in support of what should be a very broad conclusion, I might start to get suspicious. To make it concrete, imagine that our hair color/personality researcher goes around talking a big talk at conferences about how obvious he believes this link to be. And yet every time he comes out with a study, it's something like "blondes with short hair and bangs tend to have slightly more Machiavellian traits when we analyze the coffee-pot-filling behavior of office workers on Mondays using this one unusual set of statistical techniques." Sounds like somebody's been having a little too much fun with an unusually rich and unique data set.
This would be an appropriate time to remind the audience that the title of this paper was "Why Most Published Research Findings Are False." Well, by certain approximations and assumptions.
Well, now that we know based on our assumptions and approximations that most published research findings are false, I guess that gives us a basis for estimating the probability of isolated fields? Perhaps with a few more assumptions? And once we've determined those, then we might actually get around making some assumptions about individual research questions?
Yes, I have to assume they would be!
Look, there's a very strong argument to be made here. It's that if you're trying to understand a research field, it's helpful to see it as a funnel model. We start with limited understanding of the relationships at play. Gradually, we are able to elucidate them, but it takes time, so don't just take P < 0.05 as gospel truth. Early on, it's OK to do lots of cheap but not-too-decisive tests on tons of hypotheses. As the field matures, it needs to subject its most durable findings to increasingly decisive pressure tests. The relationships that survive will in turn inform how we interpret the plausibility of the more novel, less-supported findings being generated on the other end of the research pipeline.
This is a vision of an iterative design process, and it makes perfect sense.
My problem with this paper is that it projects a profoundly cynical view of this whole enterprise, and that it bases the title claim on little more than assumption piled on assumption. And it gets used as a tool by other cynics to browbeat people with even a modest appreciation for scientific research.
So let's not do that anymore, OK?