Against NHST
A summary of standard nonBayesian criticisms of common frequentist statistical practices, with pointers into the academic literature.
Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. NeymanPearson) which is focused on
 setting up a null hypothesis and an alternative hypothesis
 calculating a pvalue (possibly via a ttest or more complex alternatives like ANOVA)
 and rejecting the null if an arbitrary threshold is passed.
NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:

practitioners & statistics teachers misinterpret the meaning of a pvalue (LessWrongers too); Cohen on this persistent illusion:
What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H_{0} is true?” But as most of us know, what it tells us is “Given that H_{0} is true, what is the probability of these (or more extreme) data?” These are not the same…
(This misunderstanding is incredibly widespread; once you understand it, you'll see it everywhere. I can't count how many times I have seen a comment or blog explaining that a p=0.05 means "the probability of the null hypothesis not being true is 95%", in many different variants.)

cargoculting the use of 0.05 as an accept/reject threshold based on historical accident & custom (rather than using a loss function chosen through decision theory to set the threshold based on the cost of false positives).
Similarly, the cargoculting encourages misuse of twotailed tests, avoidance of multiple correction, data dredging, and in general, “pvalue hacking”.  failing to compare many possible hypotheses or models, and limiting themselves to one  sometimes illchosen or absurd  null hypothesis and one alternative
 deprecating the value of exploratory data analysis and depicting data graphically (see, for example, Anscombe’s quartet)
 ignoring the more important summary statistic of “effect size”
 ignoring the more important summary statistic of confidence intervals; this is related to how use of pvalues leads to ignorance of the statistical power of a study  a small study may have only a small chance of detecting an effect if it exists, but turn in misleadingly goodlooking pvalues

because null hypothesis tests cannot accept the alternative, but only reject a null, they inevitably cause false alarms upon repeated testing
(An example from my personal experience of the cost of ignoring effect size and confidence intervals: pvalues cannot (easily) be used to compile a metaanalysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual nback & iodine metaanalyses.)
Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:
Hays (1963) had a chapter on Bayesian statistics in the second edition of his widely read textbook but dropped it in the subsequent editions. As he explained to one of us (GG) he dropped the chapter upon pressure from his publisher to produce a statistical cookbook that did not hint at the existence of alternative tools for statistical inference. Furthermore, he believed that many researchers are not interested in statistical thinking in the first place but solely in getting their papers published (Gigerenzer, 2000)…When Loftus (1993) became the editor of Memory & Cognition, he made it clear in his editorial that he did not want authors to submit papers in which p, t, or Fvalues are mindlessly being calculated and reported. Rather, he asked researchers to keep it simple and report figures with error bars, following the proverb that “a picture is worth more than a thousand pvalues.” We admire Loftus for having had the courage to take this step. Years after, one of us (GG) asked Loftus about the success of his crusade against thoughtless significance testing. Loftus bitterly complained that most researchers actually refused the opportunity to escape the ritual. Even when he asked in his editorial letter to get rid of dozens of pvalues, the authors insisted on keeping them in. There is something deeply engrained in the minds of many researchers that makes them repeat the same action over and over again.
Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of metaanalysis):
Fidler et al. (2004b, 626) explain the spread of the reform in part by a shift from testing to estimation that was facilitated by the medical literature, unlike psychology, using a common measurement scale, to “strictly enforced editorial policy, virtually simultaneous reforms in a number of leading journals, and the timely rewriting [of] textbooks to fit with policy recommendations.” But their description of the process suggests that an accidental factor, the coincidence of several strongwilled editors, also mattered. For the classic collection of papers criticizing significance tests in psychology see Morrison and Hankel (1970) [The Significance Test Controversy: A Reader], and for a more recent collection of papers see Harlow et al. (1997) [What If There Were No Significance Tests?]. Nickerson (2000) provides a comprehensive survey of this literature.
0.1 Further reading
More on these topics:
 Cohen, “The Earth Is Round (p<.05)” (recommended)
 Effect size FAQ (published as The Essential Guide to Effect Sizes, Ellis)
 “The Higgs Boson at 5 Sigmas”
 The Cult of Statistical Significance, McCloskey & Ziliak 2008; criticism, their reply

“Bayesian estimation supersedes the t test”, Kruschke 2012 (see also Doing Bayesian Data Analysis); an exposition of a Bayesian paradigm, simulation of false alarm performance compared to his Bayesian code; an excerpt:
The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).

Unfortunately, I seem to have lost my source for the following quote, but it is a useful bibliography anyway:
Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).

http://www.gwern.net/DNB%20FAQ#flawsinmainstreamscienceandpsychology
Comments (57)
Can you give me a concrete course of action to take when I am writing a paper reporting my results? Suppose I have created two versions of a website, and timed 30 people completing a task on each web site. The people on the second website were faster. I want my readers to believe that this wasn't merely a statistical coincidence. Normally, I would do a ttest to show this. What are you proposing I do instead? I don't want a generalization like "use Bayesian statistics, " but a concrete example of how one would test the data and report it in a paper.
You could use Bayesian estimation to compute credible differences in mean task completion time between your groups.
Described in excruciating detail in this pdf.
Perhaps you would suggest showing the histograms of completion times on each site, along with the 95% confidence error bars?
Presumably not actually 95%, but, as gwern said, a threshold based on the cost of false positives.
Yes, in this case you could keep using pvalues (if you really wanted to...), but with reference to the value of, say, each customer. (This is what I meant by setting the threshold with respect to decision theory.) If the goal is to use on a site making millions of dollars*, 0.01 may be too loose a threshold, but if he's just messing with his personal site to help readers, a pvalue like 0.10 may be perfectly acceptable.
* If the results were that important, I think there'd be better approaches than a onceoff a/b test. Adaptive multiarmed bandit algorithms sound really cool from what I've read of them.
I'd suggest more of a scattergram than a histogram; superimposing 95% CIs would then cover the exploratory data/visualization & confidence intervals. Combine that with an effect size and one has made a good start.
I think (hope?) most people already realize NHST is terrible. I would be much more interested in hearing if there were an equallyeasytouse alternative without any baggage (preferably not requiring priors?)
NHST has been taught as The Method Of Science to lots of students. I remember setting these up explicitly in science class. I expect it will remain in the fabric of any given quantitative field until removed with force.
If you're right that that's how science works then that should make you distrustful of science. If they deserve any credibility, scientists must have some process by which they drop bad truthfinding methods instead of repeating them out of blind tradition. Do you believe scientific results?
Plenty of otherwisegood science is done based on poor statistics. Keep in mind, there are tons and tons of working scientists, and they're already pretty busy just trying to understand the content of their fields. Many are likely to view improved statistical methods as an unneeded step in getting a paper published. Others are likely to view overthrowing NHST as a good idea, but not something that they themselves have the time or energy to do. Some might repeat it out of "blind tradition"  but keep in mind that the "blind tradition" is an expensivetomove Schelling point in a very complex system.
I do expect that serious scientific fields will, eventually, throw out NHST in favor of more fundamentallysound statistical analyses. But, like any social change, it'll probably take decades at least.
Unconditionally? No, and neither should you. Beliefs don't work that way.
If a scientific paper gives a fundamentallysound statistical analysis of the effect it purports to prove, I'll give it more credence than a paper rejecting the null hypothesis at p < 0.05. On the other hand, a study rejecting the null hypothesis at p < 0.05 is going to provide far more useful information than a small collection of anecdotes, and both are probably better than my personal intuition in a field I have no experience with.
I should have said, "do you believe any scientific results?"
To clarify, I wasn't saying that maybe you shouldn't believe scientific results because they use NHST specifically. I meant that if you think that scientists tend to stick with bad methods for decades then NHST probably isn't the only bad method they're using.
As you say though, NHST is helpful in many cases even if other methods might be more helpful. So I guess it doesn't say anything that awful about the way science works.
Confidence intervals.
p<.05 means that the null hypothesis is excluded from the 95% confidence interval. Thus there is no political cost and every pvalue recipe is a fragment of an existing confidence interval recipe.
added: also, the maximum likelihood estimate is a single number that is closely related to confidence intervals, but I don't know if is sufficiently wellknown among statisticallyignorant scientists to avoid controversy.
This might be a good place to not that full Bayesianism is getting easier to practice in statistics. Doing fully Bayesian analysis has been tough for many models because it's computationally difficult since standard MCMC methods often don't scale that well, so you can only fit models with few parameters.
However, there are at least two statistical libraries STAN and PyMC3 (which I help out with) which implement Hamiltonian Monte Carlo (which scales well) and provide an easy language for model building. This allows you to fit relatively complex models, without thinking too much about how to do it.
Join the revolution!
"Power failure: why small sample size undermines the reliability of neuroscience", Button et al 2013:
Learned a new term:
One of the interesting, and still counterintuitive to me, aspects of power/beta is how it also changes the number of fake findings; typically, people think that must be governed by the pvalue or alpha ("an alpha of 0.05 means that of the positive findings only 1 in 20 will be falsely thrown up by chance!"), but no:
The actual strategy is the usual trick in metaanalysis: you take effects which have been studied enough to be metaanalyzed, take the metaanalysis result as the 'true' ground result, and reanalyze other results with that as the baseline. (I mention this because in some of the blogs, this seemed to come as news to them, that you could do this, but as far as I knew it's a perfectly ordinary approach.) This usually turns in depressing results, but actually it's not that bad  it's worse:
Not mentioned, amusingly, are the concerns about applying research to humans:
Oh great, researchers are going to end up giving this all sorts of names. Joseph Banks Rhine called it the decline effect, while Yitzhak Rabin* calls it the Truth Wears Off effect (after the Jonah Lehrer article). And now we have the Proteus phenomenon. Clearly, I need to write a paper declaring my discovery of the It Was Here, I Swear! effect.
* Not that one.
Make sure you cite my paper "Selection Effects and Regression to the Mean In Published Scientific Studies"
"Do We Really Need the Sword?" (American Scientist) covers many of the same points. I enjoyed one anecdote:
Via http://www.scottbot.net/HIAL/?p=24697 I learned that Wikipedia actually has a good roundup of misunderstandings of pvalues:
http://library.mpibberlin.mpg.de/ft/gg/GG_Null_2004.pdf
"Do Studies of Statistical Power Have an Effect on the Power of Studies?", Sedlmeier & Gigerenzer 1989
Typo?
If you run together an enumerated list like that, of course it looks weird...
"Accepting the null if the threshold is passed." Not rejecting?
Oh. Yeah, should be rejecting.
Should this be:
(deprecating the value of exploratory data analysis) and (depicting data graphically)
or
deprecating [(the value of exploratory data analysis) and (depicting data graphically)]?
ETA: Also, very nice article! I'm glad that you point out that NHST is only a small part of frequentist statistics.
I guess the latter.
And yet even you who are more against frequentist statistics than most (Given that you are even writing this among other things on the topic) inevitably use the frequentist tools. What I'd be interested in is a good and short(as short as it can be) summary of what methods should be followed to remove as many of the problems of frequentist statistics with properly defined cutoffs for pvalues and everything else, where can we fully adapt Bayes, where we can minimize the problems of the frequentist tools and so on. You know, something that I can use on its own to interpret the data if I am to conduct an experiment today the way that currently seems best.
No, I don't. My selfexperiments have long focused on effect sizes (an emphasis which is very easy to do without disruptive changes), and I have been using BEST as a replacement for ttests for a while, only including an occasional ttest as a safety blanket for my frequentist readers.
If nonNHST frequentism or even full Bayesianism were taught as much as NHST and as well supported by software like R, I don't think it would be much harder to use.
I can't find BEST (as a statistical test or similar...) on Google. What test do you refer to?
http://www.indiana.edu/~kruschke/BEST/
That'd be essentially Bayesianism with the (uninformative improper) priors (uniform for location parameters and logarithms of scale parameters) swept under the rug, right?
Not at all (I wrote a post refuting this a couple months ago but can't link it from my phone)
http://lesswrong.com/lw/f7t/beyond_bayesians_and_frequentists/ I presume.
Thanks!
I really couldn't presume to say.
'Frequentist tools' are common approximations, loaded with sometimesapplicable interpretations. A Bayesian can use the same approximation, even under the same name, and yet not be diving into Frequentism.
Another problem with NHST in particular: the choice of a null and a null distribution is itself a modeling assumption, but is rarely checked and in realworld datasets, it's entirely possible for the null distribution to be much more extreme than assumed and hence the nominal alpha/falsepositiveconditionalonnull error rates are incorrect & too forgiving. Two links on that:
"Theorytesting in psychology and physics: a methodological paradox" (Meehl 1967; excerpts) makes an interesting argument: because NHST encourages psychologists to frame their predictions in directional terms (nonzero point estimates) and because everything is correlated with everything (see Cohen), the possible amount of confirmation for any particular psychology theory compared to a 'random theory'  which predicts the sign at random  is going to be very limited.
"Robust misinterpretation of confidence intervals", Hoekstra et al 2014
"Reflections on methods of statistical inference in research on the effect of safety countermeasures", Hauer 1983; and "The harm done by tests of significance", Hauer 2004 (excerpts):
(These deadly examples obviously lend themselves to Bayesian critique, but could just as well be classified by a frequentist under several of the rubrics in OP: under failures to adjust thresholds based on decision theory, and failure to use metaanalysis or other techniques to pool data and turn a collection of nonsignificant results into a significant result.)
If the papers in the OP and comments are not enough reading material, there's many links and citations in http://stats.stackexchange.com/questions/10510/whataregoodreferencescontainingargumentsagainstnullhypothesissignifican (which is only partially redundant with this page, skimming).
"P Values are not Error Probabilities", Hubbard & Bayarri 2003
An interesting bit:
Lengthier excerpts.
"Science or Art? How Aesthetic Standards Grease the Way Through the Publication Bottleneck but Undermine Science", GinerSorolla 2012 has some good quotes.
"Not Even Scientists Can Easily Explain Pvalues"
Why not? Most people misunderstand it, but in the frequentist framework its actual meaning is quite straightforward.
A definition is not a meaning, in the same way the meaning of a hammer is not 'a long piece of metal with a round bit at one end'.
Everyone with a working memory can define a pvalue, as indeed Goodman and the others can, but what does it mean?
What kind of answer, other than philosophical deepities, would you expect in response to "...but what does it mean"? Meaning almost entirely depends on the subject and the context.
Is the meaning of a hammer describing its role and use, as opposed to a mere definition describing some physical characteristics, really a 'philosophical deepity'?
When you mumble some jargon about 'the frequency of a class of outcomes in sampling from a particular distribution', you may have defined a pvalue, but you have not given a meaning. It is numerology if left there, some gematriya played with distributions. You have not given any reason to care whatsoever about this particular arbitrary construct or explained what a p=0.04 vs a 0.06 means or why any of this is important or what you should do upon seeing one pvalue rather than another or explained what other people value about it or how it affects beliefs about anything. (Maybe you should go back and reread the Sequences, particularly the ones about words.)
Okay, stupid question :/
But
Aren't these basically the same? Can't you paraphrase them both as "the probability that you would get this result if your hypothesis was wrong"? Am I failing to understand what they mean by 'direct information'? Or am I being overly binary in assuming that the hypothesis and the null hypothesis as the only two possibilities?
Not at all. To quote Andrew Gelman,
Also see more of Gelman on the same topic.
What pvalues actually mean:
What they're commonly taken to mean?
That is, pvalues measure Pr(observations  null hypothesis) whereas what you want is more like Pr(alternative hypothesis  observations).
(Actually, what you want is more like a probability distribution for the size of the effect  that's the "overly binary* thing  but never mind that for now.)
So what are the relevant differences between these?
If your null hypothesis and alternative hypothesis are one another's negations (as they're supposed to be) then you're looking at the relationship between Pr(AB) and Pr(BA). These are famously related by Bayes' theorem, but they are certainly not the same thing. We have Pr(AB) = Pr(A&B)/Pr(B) and Pr(BA) = Pr(A&B)/Pr(A) so the ratio between the two is the ratio of probabilities of A and B. So, e.g., suppose you are interested in ESP and you do a study on precognition or something whose result has a pvalue of 0.05. If your priors are like mine, your estimate of Pr(precognition) will still be extremely small because precognition is (in advance of the experimental evidence) much more unlikely than just randomly getting however many correct guesses it takes to get a pvalue of 0.05.
In practice, the null hypothesis is usually something like "X =Y" or "X<=Y". Then your alternative is "X /= Y" or "X > Y". But in practice what you actually care about is that X and Y are substantially unequal, or X is substantially bigger than Y, and that's probably the alternative you actually have in mind even if you're doing statistical tests that just accept or reject the null hypothesis. So a small pvalue may come from a very carefully measured difference that's too small to care about. E.g., suppose that before you do your precognition study you think (for whatever reason) that precog is about as likely to be real as not. Then after the study results come in, you should in fact think it's probably real. But if you then think "aha, time to book my flight to Las Vegas" you may be making a terrible mistake even if you're right about precognition being real. Because maybe your study looked at someone predicting a million die rolls and they got 500 more right than you'd expect by chance; that would be very exciting scientifically but probably useless for casino gambling because it's not enough to outweigh the house's advantage.
[EDITED to fix a typo and clarify a bit.]
Thank you  I get it now.
"An investigation of the false discovery rate and the misinterpretation of pvalues", Colquhoun 2014; basically a more extended tutoriallike version of Ioannides's 'Why Most Published Findings are False', putting more emphasis on working through the cancerscreening metaphor to explain why a p<0.05 is much less impressive than it looks & has such high error rates.
No one understands pvalues: "Unfounded Fears: The Great PowerLine CoverUp Exposed", IEEE 1996, on the electricity/cancer panic (emphasis added to the parts clearly committing the misunderstanding of interpreting pvalues as having anything at all to do with probability of a fact or with subjective beliefs):
No one understands pvalues, not even the ones who use Bayesian methods in their other work... From "When Is Evidence Sufficient?", Claxton et al 2005:
Another fun one is a piece which quotes someone making the classic misinterpretation and then someone else immediately correcting them. From "Drug Trials: Often Long On Hype, Short on Gains; The delusion of ‘significance’ in drug trials":
Also fun, "You do not understand what a pvalue is (p < 0.001)":
Another entry from the 'no one understands pvalues' files; "Policy: Twenty tips for interpreting scientific claims", Sutherland et al 2013, Nature  there's a lot to like in this article, and it's definitely worth remembering most of the 20 tips, except for the one on pvalues:
Whups. p=0.01 does not mean our subjective probability that the effect is zero is now just 1%, and there's a 99% chance the effect is nonzero.
(The Bayesian probability could be very small or very large depending on how you set it up; if your prior is small, then data with p=0.01 will not shift your probability very much, for exactly the reason Sutherland et al 2013 explains in their section on base rates!)