Against NHST

gwern

94 Against NHST

21st Dec 2012

5 min read

94

A summary of standard non-Bayesian criticisms of common frequentist statistical practices, with pointers into the academic literature.

Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on

setting up a null hypothesis and an alternative hypothesis
calculating a p-value (possibly via a _<_a href="https://en.wikipedia.org/wiki/Student%27s_t-test">t-test or more complex alternatives like ANOVA)
and rejecting the null if an arbitrary threshold is passed.

NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:

practitioners & statistics teachers misinterpret the meaning of a p-value (LessWrongers too); Cohen on this persistent illusion:

What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…

(This misunderstanding is incredibly widespread; once you understand it, you'll see it everywhere. I can't count how many times I have seen a comment or blog explaining that a p=0.05 means "the probability of the null hypothesis not being true is 95%", in many different variants.)

cargo-culting the use of 0.05 as an accept/reject threshold based on historical accident & custom (rather than using a loss function chosen through decision theory to set the threshold based on the cost of false positives).

Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.

failing to compare many possible hypotheses or models, and limiting themselves to one - sometimes ill-chosen or absurd - null hypothesis and one alternative
deprecating the value of exploratory data analysis and depicting data graphically (see, for example, Anscombe’s quartet)
ignoring the more important summary statistic of “effect size”
ignoring the more important summary statistic of confidence intervals; this is related to how use of p-values leads to ignorance of the statistical power of a study - a small study may have only a small chance of detecting an effect if it exists, but turn in misleadingly good-looking p-values
because null hypothesis tests cannot accept the alternative, but only reject a null, they inevitably cause false alarms upon repeated testing

(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)

Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:

Hays (1963) had a chapter on Bayesian statistics in the second edition of his widely read textbook but dropped it in the subsequent editions. As he explained to one of us (GG) he dropped the chapter upon pressure from his publisher to produce a statistical cookbook that did not hint at the existence of alternative tools for statistical inference. Furthermore, he believed that many researchers are not interested in statistical thinking in the first place but solely in getting their papers published (Gigerenzer, 2000)…When Loftus (1993) became the editor of Memory & Cognition, he made it clear in his editorial that he did not want authors to submit papers in which p-, t-, or F-values are mindlessly being calculated and reported. Rather, he asked researchers to keep it simple and report figures with error bars, following the proverb that “a picture is worth more than a thousand p-values.” We admire Loftus for having had the courage to take this step. Years after, one of us (GG) asked Loftus about the success of his crusade against thoughtless significance testing. Loftus bitterly complained that most researchers actually refused the opportunity to escape the ritual. Even when he asked in his editorial letter to get rid of dozens of p-values, the authors insisted on keeping them in. There is something deeply engrained in the minds of many researchers that makes them repeat the same action over and over again.

Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):

Fidler et al. (2004b, 626) explain the spread of the reform in part by a shift from testing to estimation that was facilitated by the medical literature, unlike psychology, using a common measurement scale, to “strictly enforced editorial policy, virtually simultaneous reforms in a number of leading journals, and the timely re-writing [of] textbooks to fit with policy recommendations.” But their description of the process suggests that an accidental factor, the coincidence of several strong-willed editors, also mattered. For the classic collection of papers criticizing significance tests in psychology see Morrison and Hankel (1970) [The Significance Test Controversy: A Reader], and for a more recent collection of papers see Harlow et al. (1997) [What If There Were No Significance Tests?]. Nickerson (2000) provides a comprehensive survey of this literature.

0.1 Further reading

94

New Comment

Rendering 0/66 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 3:36 AM

Moderation Log

94 Against NHST

by gwern

21st Dec 2012

5 min read

94

A summary of standard non-Bayesian criticisms of common frequentist statistical practices, with pointers into the academic literature.

setting up a null hypothesis and an alternative hypothesis
calculating a p-value (possibly via a _<_a href="https://en.wikipedia.org/wiki/Student%27s_t-test">t-test or more complex alternatives like ANOVA)
and rejecting the null if an arbitrary threshold is passed.

NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:

practitioners & statistics teachers misinterpret the meaning of a p-value (LessWrongers too); Cohen on this persistent illusion:

(This misunderstanding is incredibly widespread; once you understand it, you'll see it everywhere. I can't count how many times I have seen a comment or blog explaining that a p=0.05 means "the probability of the null hypothesis not being true is 95%", in many different variants.)

cargo-culting the use of 0.05 as an accept/reject threshold based on historical accident & custom (rather than using a loss function chosen through decision theory to set the threshold based on the cost of false positives).

Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.

failing to compare many possible hypotheses or models, and limiting themselves to one - sometimes ill-chosen or absurd - null hypothesis and one alternative
deprecating the value of exploratory data analysis and depicting data graphically (see, for example, Anscombe’s quartet)
ignoring the more important summary statistic of “effect size”
ignoring the more important summary statistic of confidence intervals; this is related to how use of p-values leads to ignorance of the statistical power of a study - a small study may have only a small chance of detecting an effect if it exists, but turn in misleadingly good-looking p-values
because null hypothesis tests cannot accept the alternative, but only reject a null, they inevitably cause false alarms upon repeated testing

Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:

Hays (1963) had a chapter on Bayesian statistics in the second edition of his widely read textbook but dropped it in the subsequent editions. As he explained to one of us (GG) he dropped the chapter upon pressure from his publisher to produce a statistical cookbook that did not hint at the existence of alternative tools for statistical inference. Furthermore, he believed that many researchers are not interested in statistical thinking in the first place but solely in getting their papers published (Gigerenzer, 2000)…When Loftus (1993) became the editor of Memory & Cognition, he made it clear in his editorial that he did not want authors to submit papers in which p-, t-, or F-values are mindlessly being calculated and reported. Rather, he asked researchers to keep it simple and report figures with error bars, following the proverb that “a picture is worth more than a thousand p-values.” We admire Loftus for having had the courage to take this step. Years after, one of us (GG) asked Loftus about the success of his crusade against thoughtless significance testing. Loftus bitterly complained that most researchers actually refused the opportunity to escape the ritual. Even when he asked in his editorial letter to get rid of dozens of p-values, the authors insisted on keeping them in. There is something deeply engrained in the minds of many researchers that makes them repeat the same action over and over again.

Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):

Fidler et al. (2004b, 626) explain the spread of the reform in part by a shift from testing to estimation that was facilitated by the medical literature, unlike psychology, using a common measurement scale, to “strictly enforced editorial policy, virtually simultaneous reforms in a number of leading journals, and the timely re-writing [of] textbooks to fit with policy recommendations.” But their description of the process suggests that an accidental factor, the coincidence of several strong-willed editors, also mattered. For the classic collection of papers criticizing significance tests in psychology see Morrison and Hankel (1970) [The Significance Test Controversy: A Reader], and for a more recent collection of papers see Harlow et al. (1997) [What If There Were No Significance Tests?]. Nickerson (2000) provides a comprehensive survey of this literature.

0.1 Further reading

94

Mentioned in

10800 scientist call out against statistical significance

New Comment

Rendering 0/66 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 3:36 AM

Moderation Log

More from gwern

Curated and popular this week

66Comments

Comment Permalink

gwern10y00

"Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence", McShane & Gal 2015

Statistical training helps individuals analyze and interpret data. However, the emphasis placed on null hypothesis significance testing in academic training and reporting may lead researchers to interpret evidence dichotomously rather than continuously. Consequently, researchers may either disregard evidence that fails to attain statistical significance or undervalue it relative to evidence that attains statistical significance. Surveys of researchers across a wide variety of fields (including medicine, epidemiology, cognitive science, psychology, business, and economics) show that a substantial majority does indeed do so. This phenomenon is manifest both in researchers’ interpretations of descriptions of evidence and in their likelihood judgments. Dichotomization of evidence is reduced though still present when researchers are asked to make decisions based on the evidence, particularly when the decision outcome is personally consequential. Recommendations are offered.

...Formally defined as the probability of observing data as extreme or more extreme than that actually observed assuming the null hypothesis is true, the p-value has often been misinterpreted as, inter alia, (i) the probability that the null hypothesis is true, (ii) one minus the probability that the alternative hypothesis is true, or (iii) one minus the probability of replication (Bakan 1966, Sawyer and Peter 1983, Cohen 1994, Schmidt 1996, Krantz 1999, Nickerson 2000, Gigerenzer 2004, Kramer and Gigerenzer 20005).

...As an example of how dichotomous thinking manifests itself, consider how Messori et al.(1993) compared their findings with those of Hommes et al. (1992):

The result of our calculation was an odds ratio of 0.61 (95% CI [confidence interval]: 0.298–1.251; p>0.05); this figure differs greatly from the value reported by Hommes and associates (odds ratio: 0.62; 95% CI: 0.39–0.98; p<0.05)...we concluded that subcutaneous heparin is not more effective than intravenous heparin, exactly the opposite to that of Hommes and colleagues.(p. 77)

In other words, Messori et al. (1993) conclude that their findings are “exactly the opposite” of Hommes et al. (1992) because their odds ratio estimate failed to attain statistical significance whereas that of Hommes et al. attained statistical significance. In fact, however, the odds ratio estimates and confidence intervals of Messori et al. and Hommes et al. are highly consistent (for additional discussion of this example and others, see Rothman et al. 1993 and Healy 2006).

Graph of how a p-value crossing a threshold dramatically increases choosing that option, regardless of effect size: http://andrewgelman.com/wp-content/uploads/2016/04/Screen-Shot-2016-04-06-at-3.03.29-PM-1024x587.png

via Gelman:

In a forthcoming paper, my colleague David Gal and I survey top academics across a wide variety of fields including the editorial board of Psychological Science and authors of papers published in the New England Journal of Medicine, the American Economic Review, and other top journals. We show:

Researchers interpret p-values dichotomously (i.e., focus only on whether p is below or above 0.05).

They fixate on them even when they are irrelevant (e.g., when asked about descriptive statistics).

These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.

We also show they ignore the magnitudes of effect sizes.

See in context