Via http://www.scottbot.net/HIAL/?p=24697 I learned that Wikipedia actually has a good roundup of misunderstandings of p-values:
The p-value does not in itself allow reasoning about the probabilities of hypotheses; this requires multiple hypotheses or a range of hypotheses, with a [prior distribution][1] of likelihoods between them, as in [Bayesian statistics][2], in which case one uses a [likelihood function][3] for all possible values of the prior, instead of the p-value for a single null hypothesis.
The p-value refers only to a single hypothesis, called the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the [alternative hypothesis][4] in Neyman–Pearson [statistical hypothesis testing][5]. In that approach one instead has a decision function between two alternatives, often based on a [test statistic][6], and one computes the rate of [Type I and type II errors][7] as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β – instead it is fed into a decision function.
There are several common misunderstandings about p-values.[[16]][8][[17]][9]
- The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false – it is not connected to either of these.
In fact, [frequentist statistics][10] does not, and cannot, attach probabilities to hypotheses. Comparison of [Bayesian][11] and classical approaches shows that a p-value can be very close to zero while the [posterior probability][12] of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is [Lindley's paradox][13]. But there are also a priori probability distributions where the [posterior probability][12] and the p-value have similar or equal values.[[18]][14]- The p-value is not the probability that a finding is "merely a fluke."
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true.- The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called [prosecutor's fallacy][15].
- The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of [p-rep][16] (which is heavily [criticized][17])
- The significance level, such as 0.05, is not determined by the p-value.
Rather, the significance level is decided before the data are viewed, and is compared against the p-value, which is calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)- The p-value does not indicate the size or importance of the observed effect (compare with [effect size][18]). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.
Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on
NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:
What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…
Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.
(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)
Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:
Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):
0.1 Further reading
More on these topics:
The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).
Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).
[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)