No one understands p-values: "Unfounded Fears: The Great Power-Line Cover-Up Exposed", IEEE 1996, on the electricity/cancer panic (emphasis added to the parts clearly committing the misunderstanding of interpreting p-values as having anything at all to do with probability of a fact or with subjective beliefs):
Unless the number of cases is very large, an apparent cluster can rarely be distinguished from a pure chance occurrence. Thus epidemiologists check for statistical significance of data, usually at the 95% level. They use statistical tools to help distinguish chance occurrences (like the "runs" of numbers on the dice throws above) from non-random increases, i.e. those due to an external cause. If pure chance cannot be excluded with at least 95% certainty, as is very frequently the case in EMF studies, the result is usually called not significant. The observation may not mean a thing outside the specific population studied. Most often the statistical information available is expressed as an odds ratio (OR) and confidence interval (CI). The OR is the estimate of an exposed person's risk of the disease in question relative to an unexposed person's risk of the same disease. The CI is the range of ORs within which the true OR is 95% likely to lie, and when the CI includes 1.0! (no difference in risk), the OR is commonly defined as not statistically significant...Mr. Brodeur notes, "the 50% increased risk of leukemia they observed in the highest exposure category--children in whose bedrooms magnetic fields of two and two-thirds milligauss or above were recorded--was not considered to be statistically significant", as though this is an opinion. It is, however, a statement with a particular mathematical definition. The numbers of cases and controls in each category limit the certainty of the results, so that it cannot be said with 95% certainty! that the association seen is not a pure chance occurrence. In fact, it is within a 95% probability that the association is really inverse and residence in such high fields (compared to the rest of the population) actually protects against cancer.
No one understands p-values, not even the ones who use Bayesian methods in their other work... From "When Is Evidence Sufficient?", Claxton et al 2005:
...Classical statistics addresses this problem by calculating the probability that any difference observed between the treatment and the comparator (in this case the placebo) reflects noise rather than a “real” difference. Only if this probability is sufficiently small—typically 5 percent—is the treatment under investigation declared superior. In the example of the pain medication, a conventional de
Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on
NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:
What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…
Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.
(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)
Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:
Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):
0.1 Further reading
More on these topics:
The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).
Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).
[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)