No one understands p-values, not even the ones who use Bayesian methods in their other work... From "When Is Evidence Sufficient?", Claxton et al 2005:
Classical statistics addresses this problem by calculating the probability that any difference observed between the treatment and the comparator (in this case the placebo) reflects noise rather than a “real” difference. Only if this probability is sufficiently small—typically 5 percent—is the treatment under investigation declared superior. In the example of the pain medication, a conventional decisionmaker would therefore reject adoption of this new treatment if the chance that the study results represent noise exceeds 5 percent...For example, suppose we know that the new pain medication has a low risk of side effects, low cost, and the possibility of offering relief for patients with severe symptoms. In that case, does it really make sense to hold the candidate medication to the stringent 5 percent adoption criterion? Similarly, let us suppose that there is a candidate medication for patients with a terminal illness. If the evidence suggesting that it works has a 20 percent chance of representing only noise (and hence an 80 percent chance that the observed efficacy is real), does it make sense to withhold it from patients who might benefit from its use?
Another fun one is a piece which quotes someone making the classic misinterpretation and then someone else immediately correcting them. From "Drug Trials: Often Long On Hype, Short on Gains; The delusion of ‘significance’ in drug trials":
...Part of the problem, said Alex Adjei, PhD, the senior vice president of clinical research and professor and chair of the Department of Medicine at Roswell Park Cancer Institute in Buffalo, N.Y., is that oncology has lost focus on what exactly a P value means. “A P value of less than 0.05 simply means that there
Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on
NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:
What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…
Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.
(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)
Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:
Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):
0.1 Further reading
More on these topics:
The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).
Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).
[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)