Against NHST

gwern

A summary of standard non-Bayesian criticisms of common frequentist statistical practices, with pointers into the academic literature.

Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on

setting up a null hypothesis and an alternative hypothesis
calculating a p-value (possibly via a _<_a href="https://en.wikipedia.org/wiki/Student%27s_t-test">t-test or more complex alternatives like ANOVA)
and rejecting the null if an arbitrary threshold is passed.

NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:

practitioners & statistics teachers misinterpret the meaning of a p-value (LessWrongers too); Cohen on this persistent illusion:

What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…

(This misunderstanding is incredibly widespread; once you understand it, you'll see it everywhere. I can't count how many times I have seen a comment or blog explaining that a p=0.05 means "the probability of the null hypothesis not being true is 95%", in many different variants.)

cargo-culting the use of 0.05 as an accept/reject threshold based on historical accident & custom (rather than using a loss function chosen through decision theory to set the threshold based on the cost of false positives).

Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.

failing to compare many possible hypotheses or models, and limiting themselves to one - sometimes ill-chosen or absurd - null hypothesis and one alternative
deprecating the value of exploratory data analysis and depicting data graphically (see, for example, Anscombe’s quartet)
ignoring the more important summary statistic of “effect size”
ignoring the more important summary statistic of confidence intervals; this is related to how use of p-values leads to ignorance of the statistical power of a study - a small study may have only a small chance of detecting an effect if it exists, but turn in misleadingly good-looking p-values
because null hypothesis tests cannot accept the alternative, but only reject a null, they inevitably cause false alarms upon repeated testing

(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)

Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:

Hays (1963) had a chapter on Bayesian statistics in the second edition of his widely read textbook but dropped it in the subsequent editions. As he explained to one of us (GG) he dropped the chapter upon pressure from his publisher to produce a statistical cookbook that did not hint at the existence of alternative tools for statistical inference. Furthermore, he believed that many researchers are not interested in statistical thinking in the first place but solely in getting their papers published (Gigerenzer, 2000)…When Loftus (1993) became the editor of Memory & Cognition, he made it clear in his editorial that he did not want authors to submit papers in which p-, t-, or F-values are mindlessly being calculated and reported. Rather, he asked researchers to keep it simple and report figures with error bars, following the proverb that “a picture is worth more than a thousand p-values.” We admire Loftus for having had the courage to take this step. Years after, one of us (GG) asked Loftus about the success of his crusade against thoughtless significance testing. Loftus bitterly complained that most researchers actually refused the opportunity to escape the ritual. Even when he asked in his editorial letter to get rid of dozens of p-values, the authors insisted on keeping them in. There is something deeply engrained in the minds of many researchers that makes them repeat the same action over and over again.

Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):

Fidler et al. (2004b, 626) explain the spread of the reform in part by a shift from testing to estimation that was facilitated by the medical literature, unlike psychology, using a common measurement scale, to “strictly enforced editorial policy, virtually simultaneous reforms in a number of leading journals, and the timely re-writing [of] textbooks to fit with policy recommendations.” But their description of the process suggests that an accidental factor, the coincidence of several strong-willed editors, also mattered. For the classic collection of papers criticizing significance tests in psychology see Morrison and Hankel (1970) [The Significance Test Controversy: A Reader], and for a more recent collection of papers see Harlow et al. (1997) [What If There Were No Significance Tests?]. Nickerson (2000) provides a comprehensive survey of this literature.

0.1 Further reading

More on these topics:

Cohen, “The Earth Is Round (p<.05)” (recommended)
Effect size FAQ (published as The Essential Guide to Effect Sizes, Ellis)
“The Higgs Boson at 5 Sigmas”
The Cult of Statistical Significance, McCloskey & Ziliak 2008; criticism, their reply
“Bayesian estimation supersedes the t test”, Kruschke 2012 (see also Doing Bayesian Data Analysis); an exposition of a Bayesian paradigm, simulation of false alarm performance compared to his Bayesian code; an excerpt:

The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).

A useful bibliography from "A peculiar prevalence of p values just below .05", Masicampo & Lalande 2012:

Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).

[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)
[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)

A summary of standard non-Bayesian criticisms of common frequentist statistical practices, with pointers into the academic literature.

setting up a null hypothesis and an alternative hypothesis
calculating a p-value (possibly via a _<_a href="https://en.wikipedia.org/wiki/Student%27s_t-test">t-test or more complex alternatives like ANOVA)
and rejecting the null if an arbitrary threshold is passed.

NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:

practitioners & statistics teachers misinterpret the meaning of a p-value (LessWrongers too); Cohen on this persistent illusion:

(This misunderstanding is incredibly widespread; once you understand it, you'll see it everywhere. I can't count how many times I have seen a comment or blog explaining that a p=0.05 means "the probability of the null hypothesis not being true is 95%", in many different variants.)

cargo-culting the use of 0.05 as an accept/reject threshold based on historical accident & custom (rather than using a loss function chosen through decision theory to set the threshold based on the cost of false positives).

Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.

failing to compare many possible hypotheses or models, and limiting themselves to one - sometimes ill-chosen or absurd - null hypothesis and one alternative
deprecating the value of exploratory data analysis and depicting data graphically (see, for example, Anscombe’s quartet)
ignoring the more important summary statistic of “effect size”
ignoring the more important summary statistic of confidence intervals; this is related to how use of p-values leads to ignorance of the statistical power of a study - a small study may have only a small chance of detecting an effect if it exists, but turn in misleadingly good-looking p-values
because null hypothesis tests cannot accept the alternative, but only reject a null, they inevitably cause false alarms upon repeated testing

Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:

Hays (1963) had a chapter on Bayesian statistics in the second edition of his widely read textbook but dropped it in the subsequent editions. As he explained to one of us (GG) he dropped the chapter upon pressure from his publisher to produce a statistical cookbook that did not hint at the existence of alternative tools for statistical inference. Furthermore, he believed that many researchers are not interested in statistical thinking in the first place but solely in getting their papers published (Gigerenzer, 2000)…When Loftus (1993) became the editor of Memory & Cognition, he made it clear in his editorial that he did not want authors to submit papers in which p-, t-, or F-values are mindlessly being calculated and reported. Rather, he asked researchers to keep it simple and report figures with error bars, following the proverb that “a picture is worth more than a thousand p-values.” We admire Loftus for having had the courage to take this step. Years after, one of us (GG) asked Loftus about the success of his crusade against thoughtless significance testing. Loftus bitterly complained that most researchers actually refused the opportunity to escape the ritual. Even when he asked in his editorial letter to get rid of dozens of p-values, the authors insisted on keeping them in. There is something deeply engrained in the minds of many researchers that makes them repeat the same action over and over again.

Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):

Fidler et al. (2004b, 626) explain the spread of the reform in part by a shift from testing to estimation that was facilitated by the medical literature, unlike psychology, using a common measurement scale, to “strictly enforced editorial policy, virtually simultaneous reforms in a number of leading journals, and the timely re-writing [of] textbooks to fit with policy recommendations.” But their description of the process suggests that an accidental factor, the coincidence of several strong-willed editors, also mattered. For the classic collection of papers criticizing significance tests in psychology see Morrison and Hankel (1970) [The Significance Test Controversy: A Reader], and for a more recent collection of papers see Harlow et al. (1997) [What If There Were No Significance Tests?]. Nickerson (2000) provides a comprehensive survey of this literature.

0.1 Further reading

More on these topics:

Cohen, “The Earth Is Round (p<.05)” (recommended)
Effect size FAQ (published as The Essential Guide to Effect Sizes, Ellis)
“The Higgs Boson at 5 Sigmas”
The Cult of Statistical Significance, McCloskey & Ziliak 2008; criticism, their reply
“Bayesian estimation supersedes the t test”, Kruschke 2012 (see also Doing Bayesian Data Analysis); an exposition of a Bayesian paradigm, simulation of false alarm performance compared to his Bayesian code; an excerpt:

A useful bibliography from "A peculiar prevalence of p values just below .05", Masicampo & Lalande 2012:

[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)
[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)

"Power failure: why small sample size undermines the reliability of neuroscience", Button et al 2013:

A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.

Learned a new term:

Proteus phenomenon: The Proteus phenomenon refers to the situation in which the first published study is often the most biased towards an extreme result (the winner’s curse). Subsequent replication studies tend to be less biased towards the extreme, often finding evidence of smaller effects or even contradicting the findings from the initial study.

One of the interesting, and still counter-intuitive to me, aspects of power/beta is how it also changes the number of fake findings; typically, people think that must be governed by the p-value or alpha ("an alpha of 0.05 means that of the positive findings only 1 in 20 will be falsely thrown up by chance!"), but no:

For example, suppose that we work in a scientific field in which one in five of the effects we test are expected to be truly non-null (that is, R = 1 / (5 – 1) = 0.25) and that we claim to have discovered an effect when we reach p < 0.05; if our studies have 20% power, then PPV = 0.20 × 0.25 / (0.20 × 0.25 + 0.05) = 0.05 / 0.10 = 0.50; that is, only half of our claims for discoveries will be correct. If our studies have 80% power, then PPV = 0.80 × 0.25 / (0.80 × 0.25 + 0.05) = 0.20 / 0.25 = 0.80; that is, 80% of our claims for discoveries will be correct.

Third, even when an underpowered study discovers a true effect, it is likely that the estimate of the magnitude of that effect provided by that study will be exaggerated. This effect inflation is often referred to as the ‘winner’s curse’13 and is likely to occur whenever claims of discovery are based on thresholds of statistical significance (for example, p < 0.05) or other selection filters (for example, a Bayes factor better than a given value or a false-discovery rate below a given value). Effect inflation is worst for small, low-powered studies, which can only detect effects that happen to be large. If, for example, the true effect is medium-sized, only those small studies that, by chance, overestimate the magnitude of the effect will pass the threshold for discovery. To illustrate the winner’s curse, suppose that an association truly exists with an effect size that is equivalent to an odds ratio of 1.20, and we are trying to discover it by performing a small (that is, under-powered) study. Suppose also that our study only has the power to detect an odds ratio of 1.20 on average 20% of the time. The results of any study are subject to sampling variation and random error in the measurements of the variables and outcomes of interest. Therefore, on average, our small study will find an odds ratio of 1.20 but, because of random errors, our study may in fact find an odds ratio smaller than 1.20 (for example, 1.00) or an odds ratio larger than 1.20 (for example, 1.60). Odds ratios of 1.00 or 1.20 will not reach statistical significance because of the small sample size. We can only claim the association as nominally significant in the third case, where random error creates an odds ratio of 1.60. The winner’s curse means, therefore, that the ‘lucky’ scientist who makes the discovery in a small study is cursed by finding an inflated effect.

Publication bias and selective reporting of outcomes and analyses are also more likely to affect smaller, under-powered studies17. Indeed, investigations into publication bias often examine whether small studies yield different results than larger ones18. Smaller studies more readily disappear into a file drawer than very large studies that are widely known and visible, and the results of which are eagerly anticipated (although this correlation is far from perfect). A ‘negative’ result in a high-powered study cannot be explained away as being due to low power 19,20, and thus reviewers and editors may be more willing to publish it, whereas they more easily reject a small ‘negative study as being inconclusive or uninformative21. The protocols of large studies are also more likely to have been registered or otherwise made publicly available, so that deviations in the analysis plans and choice of outcomes may become obvious more easily. Small studies, conversely, are often subject to a higher level of exploration of their results and selective reporting thereof.

The actual strategy is the usual trick in meta-analysis: you take effects which have been studied enough to be meta-analyzed, take the meta-analysis result as the 'true' ground result, and re-analyze other results with that as the baseline. (I mention this because in some of the blogs, this seemed to come as news to them, that you could do this, but as far as I knew it's a perfectly ordinary approach.) This usually turns in depressing results, but actually it's not that bad - it's worse:

Any attempt to establish the average statistical power in neuroscience is hampered by the problem that the true effect sizes are not known. One solution to this problem is to use data from meta-analyses. Meta-analysis provides the best estimate of the true effect size, albeit with limitations, including the limitation that the individual studies that contribute to a meta-analysis are themselves subject to the problems described above. If anything, summary effects from meta-analyses, including power estimates calculated from meta-analysis results, may also be modestly inflated22.

Our results indicate that the median statistical power in neuroscience is 21%. We also applied a test for an excess of statistical significance72. This test has recently been used to show that there is an excess significance bias in the literature of various fields, including in studies of brain volume abnormalities73, Alzheimer’s disease genetics70,74 and cancer biomarkers75. The test revealed that the actual number (349) of nominally significant studies in our analysis was significantly higher than the number expected (254; p < 0.0001). Importantly, these calculations assume that the summary effect size reported in each study is close to the true effect size, but it is likely that they are inflated owing to publication and other biases described above.

Previous analyses of studies using animal models have shown that small studies consistently give more favourable (that is, ‘positive’) results than larger studies78 and that study quality is inversely related to effect size79–82.

Not mentioned, amusingly, are the concerns about applying research to humans:

In order to achieve 80% power to detect, in a single study, the most probable true effects as indicated by the meta-analysis, a sample size of 134 animals would be required for the water maze experiment (assuming an effect size of d = 0.49) and 68 animals for the radial maze experiment (assuming an effect size of d = 0.69); to achieve 95% power, these sample sizes would need to increase to 220 and 112, respectively. What is particularly striking, however, is the inefficiency of a continued reliance on small sample sizes. Despite the apparently large numbers of animals required to achieve acceptable statistical power in these experiments, the total numbers of animals actually used in the studies contributing to the meta-analyses were even larger: 420 for the water maze experiments and 514 for the radial maze experiments.

There is ongoing debate regarding the appropriate balance to strike between using as few animals as possible in experiments and the need to obtain robust, reliable findings. We argue that it is important to appreciate the waste associated with an underpowered study — even a study that achieves only 80% power still presents a 20% possibility that the animals have been sacrificed without the study detecting the underlying true effect. If the average power in neuroscience animal model studies is between 20–30%, as we observed in our analysis above, the ethical implications are clear.

Learned a new term:

Proteus phenomenon: The Proteus phenomenon refers to the situation in which the first published study is often the most biased towards an extreme result (the winner’s curse). Subsequent replication studies tend to be less biased towards the extreme, often finding evidence of smaller effects or even contradicting the findings from the initial study.

Oh great, researchers are going to end up giving this all sorts of names. Joseph Banks Rhine called it the decline effect, while Yitzhak Rabin* calls it the Truth Wears Off effect (after ... (read more)