How to Fix Science

lukeprog

71 How to Fix Science

by lukeprog

7th Mar 2012

6 min read

144

71

Like The Cognitive Science of Rationality, this is a post for beginners. Send the link to your friends!

Science is broken. We know why, and we know how to fix it. What we lack is the will to change things.

In 2005, several analyses suggested that most published results in medicine are false. A 2008 review showed that perhaps 80% of academic journal articles mistake "statistical significance" for "significance" in the colloquial meaning of the word, an elementary error every introductory statistics textbook warns against. This year, a detailed investigation showed that half of published neuroscience papers contain one particular simple statistical mistake.

Also this year, a respected senior psychologist published in a leading journal a study claiming to show evidence of precognition. The editors explained that the paper was accepted because it was written clearly and followed the usual standards for experimental design and statistical methods.

Science writer Jonah Lehrer asks: "Is there something wrong with the scientific method?"

Yes, there is.

This shouldn't be a surprise. What we currently call "science" isn't the best method for uncovering nature's secrets; it's just the first set of methods we've collected that wasn't totally useless like personal anecdote and authority generally are.

As time passes we learn new things about how to do science better. The Ancient Greeks practiced some science, but few scientists tested hypotheses against mathematical models before Ibn al-Haytham's 11th-century Book of Optics (which also contained hints of Occam's razor and positivism). Around the same time, Al-Biruni emphasized the importance of repeated trials for reducing the effect of accidents and errors. Galileo brought mathematics to greater prominence in scientific method, Bacon described eliminative induction, Newton demonstrated the power of consilience (unification), Peirce clarified the roles of deduction, induction, and abduction, and Popper emphasized the importance of falsification. We've also discovered the usefulness of peer review, control groups, blind and double-blind studies, plus a variety of statistical methods, and added these to "the" scientific method.

In many ways, the best science done today is better than ever — but it still has problems, and most science is done poorly. The good news is that we know what these problems are and we know multiple ways to fix them. What we lack is the will to change things.

This post won't list all the problems with science, nor will it list all the promising solutions for any of these problems. (Here's one I left out.) Below, I only describe a few of the basics.

Problem 1: Publication bias

When the study claiming to show evidence of precognition was published, psychologist Richard Wiseman set up a registry for advance announcement of new attempts to replicate the study.

Carl Shulman explains:

A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, all of the subsequent replications have, unsurprisingly, failed to replicate Bem's results. However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.

From the journals' point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal prestige (which depends on citations per article), even though this means most of the 'discoveries' they publish will be false despite their low p-values (high statistical significance). However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results presented by others.

This is an example of publication bias:

Publication bias is the term for what occurs whenever the research that appears in the published literature is systematically unrepresentative of the population of completed studies. Simply put, when the research that is readily available differs in its results from the results of all the research that has been done in an area, readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases this can have dramatic consequences, as when an ineffective or dangerous treatment is falsely viewed as safe and effective. [Rothstein et al. 2005]

Sometimes, publication bias can be more deliberate. The anti-inflammatory drug Rofecoxib (Vioxx) is a famous case. The drug was prescribed to 80 million people, but in it was later revealed that its maker, Merck, had withheld evidence of the drug's risks. Merck was forced to recall the drug, but it had already resulted in 88,000-144,000 cases of serious heart disease.

Example partial solution

One way to combat publication bias is for journals to only accept experiments that were registered in a public database before they began. This allows scientists to see which experiments were conducted but never reported (perhaps due to negative results). Several prominent medical journals (e.g. The Lancet and JAMA) now operate this way, but this protocol is not as widespread as it could be.

Problem 2: Experimenter bias

Scientists are humans. Humans are affected by cognitive heuristics and biases (or, really, humans just are cognitive heuristics and biases), and they respond to incentives that may not align with an optimal pursuit of truth. Thus, we should expect experimenter bias in the practice of science.

There are many stages in research during which experimenter bias can occur:

in reading-up on the field,
in specifying and selecting the study sample,
in [performing the experiment],
in measuring exposures and outcomes,
in analyzing the data,
in interpreting the analysis, and
in publishing the results. [Sackett 1979]

Common biases have been covered elsewhere on Less Wrong, so I'll let those articles explain how biases work.

Example partial solution

There is some evidence that the skills of rationality (e.g. cognitive override) are teachable. Training scientists to notice and meliorate biases that arise in their thinking may help them to reduce the magnitude and frequency of the thinking errors that may derail truth-seeking attempts during each stage of the scientific process.

Problem 3: Bad statistics

I remember when my statistics professor first taught me the reasoning behind "null hypothesis significance testing" (NHST), the standard technique for evaluating experimental results. NHST uses "p-values," which are statements about the probability of getting some data (e.g. one's experimental results) given the hypothesis being tested. I asked my professor, "But don't we want to know the probability of the hypothesis we're testing given the data, not the other way around?" The reply was something about how this was the best we could do. (But that's false, as we'll see in a moment.)

Another problem is that NHST computes the probability of getting data as unusual as the data one collected by considering what might be expected if that particular experiment was repeated many, many times. But how do we know anything about these imaginary repetitions? If I want to know something about a particular earthquake, am I supposed to imagine a few dozen repetitions of that earthquake? What does that even mean?

I tried to answer these questions on my own, but all my textbooks assumed the soundness of the mistaken NHST framework for scientific practice. It's too bad I didn't have a class with biostatistican Steven Goodman, who says:

The p-value is almost nothing sensible you can think of. I tell students to give up trying.

The sad part is that the logical errors of NHST are old news, and have been known ever since Ronald Fisher began advocating NHST in the 1920s. By 1960, Fisher had out-advocated his critics, and philosopher William Rozeboom remarked:

Despite the awesome pre-eminence [NHST] has attained... it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.

There are many more problems with NHST and with "frequentist" statistics in general, but the central one is this: NHST does not follow from the axioms (foundational logical rules) of probability theory. It is a grab-bag of techniques that, depending on how those techniques are applied, can lead to different results when analyzing the same data — something that should horrify every mathematician.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

So why aren't all scientists using Bayesian inference instead of frequentist inference? Partly, we can blame the vigor of NHST's early advocates. But we can also attribute NHST's success to the simple fact that Bayesian calculations can be more difficult than frequentist calculations. Luckily, new software tools like WinBUGS let computers do most of the heavy lifting required for Bayesian inference.

There's also the problem of sheer momentum. Once a practice is enshrined, it's hard to dislodge it, even for good reasons. I took three statistics courses in university and none of my textbooks mentioned Bayesian inference. I didn't learn about it until I dropped out of university and studied science and probability theory on my own.

Remember the study about precognition? Not surprisingly, it was done using NHST. A later Bayesian analysis of the data disconfirmed the original startling conclusion.

Example partial solution

This one is obvious: teach students probability theory instead of NHST. Retrain current scientists in Bayesian methods. Make Bayesian software tools easier to use and more widespread.

Conclusion

If I'm right that there is unambiguous low-hanging fruit for improving scientific practice, this suggests that particular departments, universities, or private research institutions can (probabilistically) out-perform their rivals (in terms of actual discoveries, not just publications) given similar resources.

I'll conclude with one particular specific hypothesis. If I'm right, then a research group should be able to hire researchers trained in Bayesian reasoning and in catching publication bias and experimenter bias, and have them extract from the existing literature valuable medical truths that the mainstream medical community doesn't yet know about. This prediction, in fact, is about to be tested.

Practice & Philosophy of ScienceReplication Crisis

Frontpage

71

New Comment

Rendering 0/144 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:54 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

71 How to Fix Science

by lukeprog

7th Mar 2012

6 min read

144

71

Like The Cognitive Science of Rationality, this is a post for beginners. Send the link to your friends!

Science is broken. We know why, and we know how to fix it. What we lack is the will to change things.

Science writer Jonah Lehrer asks: "Is there something wrong with the scientific method?"

Yes, there is.

This post won't list all the problems with science, nor will it list all the promising solutions for any of these problems. (Here's one I left out.) Below, I only describe a few of the basics.

Problem 1: Publication bias

When the study claiming to show evidence of precognition was published, psychologist Richard Wiseman set up a registry for advance announcement of new attempts to replicate the study.

Carl Shulman explains:

A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, all of the subsequent replications have, unsurprisingly, failed to replicate Bem's results. However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.

From the journals' point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal prestige (which depends on citations per article), even though this means most of the 'discoveries' they publish will be false despite their low p-values (high statistical significance). However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results presented by others.

This is an example of publication bias:

Publication bias is the term for what occurs whenever the research that appears in the published literature is systematically unrepresentative of the population of completed studies. Simply put, when the research that is readily available differs in its results from the results of all the research that has been done in an area, readers and reviewers of that research are in danger of drawing the wrong conclusion about what that body of research shows. In some cases this can have dramatic consequences, as when an ineffective or dangerous treatment is falsely viewed as safe and effective. [Rothstein et al. 2005]

Example partial solution

Problem 2: Experimenter bias

There are many stages in research during which experimenter bias can occur:

in reading-up on the field,
in specifying and selecting the study sample,
in [performing the experiment],
in measuring exposures and outcomes,
in analyzing the data,
in interpreting the analysis, and
in publishing the results. [Sackett 1979]

Common biases have been covered elsewhere on Less Wrong, so I'll let those articles explain how biases work.

Example partial solution

Problem 3: Bad statistics

The p-value is almost nothing sensible you can think of. I tell students to give up trying.

Despite the awesome pre-eminence [NHST] has attained... it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.

The inferential method that solves the problems with frequentism — and, more importantly, follows deductively from the axioms of probability theory — is Bayesian inference.

Remember the study about precognition? Not surprisingly, it was done using NHST. A later Bayesian analysis of the data disconfirmed the original startling conclusion.

Example partial solution

This one is obvious: teach students probability theory instead of NHST. Retrain current scientists in Bayesian methods. Make Bayesian software tools easier to use and more widespread.

Conclusion

Practice & Philosophy of ScienceReplication Crisis

Frontpage

71

Mentioned in

63The Control Group Is Out Of Control

44Why Academic Papers Are A Terrible Discussion Forum

42How about testing our ideas?

36[Link] Failed replications of the "elderly walking" priming effect

20"The Journal of Real Effects"

Load More (5/8)

New Comment

Rendering 0/144 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:54 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from lukeprog

Curated and popular this week

144Comments

144

Comment Permalink

Scott Alexander14y700

I only had time to double-check one of the scary links at the top, and I wasn't too impressed with what I found:

In 2010, a careful review showed that published industry-sponsored trials are four times more likely to show positive results than published independent studies, even though the industry-sponsored trials tend to use better experimental designs.

But the careful review you link to claims that studies funded by the industry report 85% positive results, compared to 72% positive by independent organizations and 50% positive by government - which is not what I think of when I hear four times! They also give a lot of reasons to think the difference may be benign: industry tends to do different kinds of studies than independent orgs. The industry studies are mainly Phase III/IV - a part of the approval process where drugs that have already been shown to work in smaller studies are tested on a larger population; the nonprofit and government studies are more often Phase I/II - the first check to see whether a promising new chemical works at all. It makes sense that studies on a drug which has already been found to probably work are more positive than the first studies on a totally new chemical. And the degree to which pharma studies are more likely to be late-phase is greater than the degree to which pharma companies are more likely to show positive results, and the article doesn't give stats comparing like to like! The same review finds with p < .001 that pharma studies are bigger, which again would make them more likely to find a result where one exists.

The only mention of the "4x more likely" number is buried in the Discussion section and cites a completely different study, Lexchin et al.

Lexchin reports an odds ratio of 4, which I think is what your first study meant when they say "industry studies are four times more likely to be positive". Odds ratios have always been one of my least favorite statistical concepts, and I always feel like I'm misunderstanding them somehow, but I don't think "odds ratio of 4" and "four times more likely" are connotatively similar (someone smarter, please back me up on this?!). For example, the largest study in Lexchin's meta-analysis, Yaphe et al, finds that 87% of industry studies are positive versus 65% of independent studies, for an odds ratio of 3.45x. But when I hear something like "X is four times more likely than Y", I think of Y being 20% likely and X being 80% likely; not 65% vs. 87%.

This means Lexchin's results are very very similar to those of the original study you cite, which provides some confirmation that those are probably the true numbers. Lexchin also provides another hypothesis for what's going on. He says that "the research methods of trials sponsored by drug companies is at least as good as that of non-industry funded research and in many cases better", but that along with publication bias, industry fudges the results by comparing their drug to another drug, and then giving the other drug wrong. For example, if your company makes Drug X, you sponsor a study to prove that it's better than Drug Y, but give patients Drug Y at a dose that's too low to do any good (or so high that it produces side effects). Then they conduct that study absolutely perfectly and get the correct result that their drug is better than another drug at the wrong dosage. This doesn't seem like the sort of thing Bayesian statistics could fix; in fact, it sounds like it means study interpretation would require domain-specific medical knowledge; someone who could say "Wait a second, that's not how we usually give penicillin!" I don't know whether this means industry studies that compare their drug against a placebo are more trustworthy.

So, summary. Industry studies seem to hover around 85% positive, non-industry studies around 65%. Part of this is probably because industry studies are more likely to be on drugs that there's already some evidence that they work, and not due to scientific misconduct at all. More of it is due to publication bias and to getting the right answer to a wrong question like "Does this work better than another drug when the other is given improperly?".

Phrases like "Industry studies are four times more likely to show positive results" are connotatively inaccurate and don't support any of these proposals at all, except maybe the one to reduce publication bias.

This reinforces my prejudice that a lot of the literature on how misleading the literature is, is itself among the best examples of how misleading the literature is.

Showing 3 of 5 replies (Click to show all)

CarlShulman14y30

This reinforces my prejudice that a lot of the literature on how misleading the literature is, is itself among the best examples of how misleading the literature is.

At the least, it allows one to argue that the claim "scientific papers are generally reliable" is self-undermining. The prior probability is also high, given the revolving door of "study of the week" science reporting we all are regularly exposed to.

32Douglas_Knight14y

Yes, "four times as likely" is not the same as an odds ratio of four. And the problem here is the same as the problem in army1987's LL link that odds ratios get mangled in transmission. But I like odds ratios. In the limit of small probability, odds ratios are the same as "times as likely." But there's nothing 4x as likely as 50%. Does that mean that 50% is very similar to all larger probabilities? Odds ratios are unchanged (or inverted) by taking complements: 4% to 1% is an odds ratio of about 4; 99% to 96% is also 4 (actually 4.1 in both cases). Complementation is exactly what's going on here. The drug companies get 1.2x-1.3x more positive results than the independent studies. That doesn't sound so big, but everyone is likely to get positive results. If we speak in terms of negative results, the independent studies are 2-3x likely to get negative results as the drug companies. Now it sounds like a big effect. Odds ratios give a canonical distance between probabilities that doesn't let people cherry-pick between 34% more positives and 3x more negatives. They give us a way to compare any two probabilities that is the obvious one for very small probabilities and is related to the obvious one for very large probabilities. The cost of interpolating between the ends is that they are confusing in the middle. In particular, this "3x more negatives" turns into an odds ratio of 4. Sometimes 50% really is similar to all larger probabilities. Sometimes you have a specific view on things and should use that, rather than the off the shelf odd ratio. But that doesn't seem to be true here.

3Will_Newsome14y

A lot of the literature on cognitive biases is itself among the best examples of how biased people are (though unfortunately not usually in ways that would prove their point, with the obvious exception of confirmation bias).

See in context