"Robust misinterpretation of confidence intervals", Hoekstra et al 2014
Confidence intervals (CIs) have frequently been proposed as a more useful alternative to NHST, and their use is strongly encouraged in the APA Manual. Nevertheless, little is known about how researchers interpret CIs. In this study, 120 researchers and 442 students-all in the field of psychology-were asked to assess the truth value of six particular statements involving different interpretations of a CI. Although all six statements were false, both researchers and students endorsed, on average, more than three statements, indicating a gross misunderstanding of CIs. Self-declared experience with statistics was not related to researchers' performance, and, even more surprisingly, researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever. Our findings suggest that many researchers do not know the correct interpretation of a CI.
...Falk and Greenbaum (1995) found similar results in a replication of Oakes's study, and Haller and Krauss (2002) showed that even professors and lecturers teaching statistics often endorse false statements about the results from NHST. Lecoutre, Poitevineau, and Lecoutre (2003) found the same for statisticians working for pharmaceutical companies, and Wulff and colleagues reported misunderstandings in doctors and dentists (Scheutz, Andersen, & Wulff, 1988; Wulff, Andersen, Brandenhoff, & Guttler, 1987). Hoekstra et al. (2006) showed that in more than half of a sample of published articles, a nonsignificant outcome was erroneously interpreted as proof for the absence of an effect, and in about 20% of the articles, a significant finding was considered absolute proof of the existence of an effect. In sum, p-values are often misinterpreted, even by researchers who use them on a regular basis.
- Falk, R., & Greenbaum, C. W. (1995). "Significance tests die hard: The amazing persistence of a probabilistic misconception". Theory and Psychology, 5, 75-98.
- Haller, H., & Krauss, S. (2002). "Misinterpretations of significance: a problem students share with their teachers?" Methods of Psychological Research Online [On-line serial], 7, 120.
- Lecoutre, M.-P., Poitevineau, J., & Lecoutre, B. (2003). "Even statisticians are not immune to misinterpretations of null hypothesis tests". International Journal of Psychology, 38, 37–45.
- Scheutz, F., Andersen, B., & Wulff, H. R. (1988). "What do dentists know about statistics?" Scandinavian Journal of Dental Research, 96, 281–287
- Wulff, H. R., Andersen, B., Brandenhoff, P., & Guttler, F. (1987). "What do doctors know about statistics?" Statistics in Medicine, 6, 3–10
- Hoekstra, R., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). "Probability as certainty: Dichotomous thinking and the misuse of p-values". Psychonomic Bulletin & Review, 13, 1033–1037
...Our sample consisted of 442 bachelor students, 34 master students, and 120 researchers (i.e., PhD students and faculty). The bachelor students were first-year psychology students attending an introductory statistics class at the University of Amsterdam. These students had not yet taken any class on inferential statistics as part of their studies. The master students were completing a degree in psychology at the University of Amsterdam and, as such, had received a substantial amount of education on statistical inference in the previous 3 years. The researchers came from the universities of Groningen (n = 49), Amsterdam (n = 44), and Tilburg (n = 27).
...The questionnaire featured six statements, all of which were incorrect. This design choice was inspired by the p-value questionnaire from Gigerenzer (2004). Researchers who are aware of the correct interpretation of a CI should have no difficulty checking all "false" boxes. The (incorrect) statements are the following:
- "The probability that the true mean is greater than 0 is at least 95%."
- "The probability that the true mean equals 0 is smaller than 5%."
- "The 'null hypothesis' that the true mean equals 0 is likely to be incorrect."
- "There is a 95% probability that the true mean lies between 0.1 and 0.4."
- "We can be 95% confident that the true mean lies between 0.1 and 0.4."
- "If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4."
Statements 1, 2, 3, and 4 assign probabilities to parameters or hypotheses, something that is not allowed within the frequentist framework. Statements 5 and 6 mention the boundaries of the CI (i.e., 0.1 and 0.4), whereas, as was stated above, a CI can be used to evaluate only the procedure and not a specific interval. The correct statement, which was absent from the list, is the following: "If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean."
...The mean numbers of items endorsed for first-year students, master students, and researchers were 3.51 (99% CI = [3.35, 3.68]), 3.24 (99% CI = [2.40, 4.07]), and 3.45 (99% CI = [3.08, 3.82]), respectively. The item endorsement proportions are presented per group in Fig. 1. Notably, despite the first-year students' complete lack of education on statistical inference, they clearly do not form an outlying group...Indeed, the correlation between endorsed items and experience was even slightly positive (0.04; 99% CI = [−0.20; 0.27]), contrary to what one would expect if experience decreased the number of misinterpretations.
Frequentist statistics is a wide field, but in practice by innumerable psychologists, biologists, economists etc, frequentism tends to be a particular style called “Null Hypothesis Significance Testing” (NHST) descended from R.A. Fisher (as opposed to eg. Neyman-Pearson) which is focused on
NHST became nearly universal between the 1940s & 1960s (see Gigerenzer 2004, pg18), and has been heavily criticized for as long. Frequentists criticize it for:
What’s wrong with NHST? Well, among other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is, “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” These are not the same…
Similarly, the cargo-culting encourages misuse of two-tailed tests, avoidance of multiple correction, data dredging, and in general, “p-value hacking”.
(An example from my personal experience of the cost of ignoring effect size and confidence intervals: p-values cannot (easily) be used to compile a meta-analysis (pooling of multiple studies); hence, studies often do not include the necessary information about means, standard deviations, or effect sizes & confidence intervals which one could use directly. So authors must be contacted, and they may refuse to provide the information or they may no longer be available; both have happened to me in trying to do my dual n-back & iodine meta-analyses.)
Critics’ explanations for why a flawed paradigm is still so popular focus on the ease of use and its weakness; from Gigerenzer 2004:
Shifts away from NHST have happened in some fields. Medical testing seems to have made such a shift (I suspect due to the rise of meta-analysis):
0.1 Further reading
More on these topics:
The perils of NHST, and the merits of Bayesian data analysis, have been expounded with increasing force in recent years (e.g., W. Edwards, Lindman, & Savage, 1963; Kruschke, 2010b, 2010a, 2011c; Lee & Wagenmakers, 2005; Wagenmakers, 2007).
Although the primary emphasis in psychology is to publish results on the basis of NHST (Cumming et al., 2007; Rosenthal, 1979), the use of NHST has long been controversial. Numerous researchers have argued that reliance on NHST is counterproductive, due in large part because p values fail to convey such useful information as effect size and likelihood of replication (Clark, 1963; Cumming, 2008; Killeen, 2005; Kline, 2009 [Becoming a behavioral science researcher: A guide to producing research that matters]; Rozeboom, 1960). Indeed, some have argued that NHST has severely impeded scientific progress (Cohen, 1994; Schmidt, 1996) and has confused interpretations of clinical trials (Cicchetti et al., 2011; Ocana & Tannock, 2011). Some researchers have stated that it is important to use multiple, converging tests alongside NHST, including effect sizes and confidence intervals (Hubbard & Lindsay, 2008; Schmidt, 1996). Others still have called for NHST to be completely abandoned (e.g., Carver, 1978).
[http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology](http://www.gwern.net/DNB%20FAQ#flaws-in-mainstream-science-and-psychology)[https://www.reddit.com/r/DecisionTheory/](https://www.reddit.com/r/DecisionTheory/)