Case study: abuse of frequentist statistics

Cyan

46 Case study: abuse of frequentist statistics

21st Feb 2010

4 min read

46

Recently, a colleague was reviewing an article whose key justification rested on some statistics that seemed dodgy to him, so he came to me for advice. (I guess my boss, the resident statistician, was out of his office.) Now, I'm no expert in frequentist statistics. My formal schooling in frequentist statistics comes from my undergraduate chemical engineering curriculum -- I wouldn't rely on it for consulting. But I've been working for someone who is essentially a frequentist for a year and a half, so I've had some hands-on experience. My boss hired me on the strength of my experience with Bayesian statistics, which I taught myself in grad school, and one thing reading the Bayesian literature voraciously will equip you for is critiquing frequentist statistics. So I felt competent enough to take a look.¹

The article compared an old, trusted experimental method with the authors' new method; the authors sought to show that the new method gave the same results on average as the trusted method. They performed three replicates using the trusted method and three replicates using the new method; each replicate generated a real-valued data point. They did this in nine different conditions, and for each condition, they did a statistical hypothesis test. (I'm going to lean heavily on Wikipedia for explanations of the jargon terms I'm using, so this post is actually a lot longer than it appears on the page. If you don't feel like following along, the punch line is three paragraphs down, last sentence.)

The authors used what's called a Mann-Whitney U test, which, in simplified terms, aims to determine if two sets of data come from different distributions. The essential thing to know about this test is that it doesn't depend on the actual data except insofar as those data determine the ranks of the data points when the two data sets are combined. That is, it throws away most of the data, in the sense that data sets that generate the same ranking are equivalent under the test. The rationale for doing this is that it makes the test "non-parametric" -- you don't need to assume a particular form for the probability density when all you look at are the ranks.

The output of a statistical hypothesis test is a p-value; one pre-establishes a threshold for statistical significance, and if the the p-value is lower than the threshold, one draws a certain conclusion called "rejecting the null hypothesis". In the present case, the null hypothesis is that the old method and the new method produce data from the same distribution; the authors would like to see data that do not lead to rejection of the null hypothesis. They established the conventional threshold of 0.05, and for each of the nine conditions, they reported either "p > 0.05" or "p = 0.05"². Thus they did not reject the null hypothesis, and argued that the analysis supported their thesis.

Now even from a frequentist perspective, this is wacky. Hypothesis testing can reject a null hypothesis, but cannot confirm it, as discussed in the first paragraph of the Wikipedia article on null hypotheses. But this is not the real WTF, as they say. There are twenty ways to choose three objects out of six, so there are only twenty possible p-values, and these can be computed even when the original data are not available, since they only depend on ranks. I put these facts together within a day of being presented with the analysis and quickly computed all twenty p-values. Here I only need discuss the most extreme case, where all three of the data points for the new method are to one side (either higher or lower) of the three data points for the trusted method. This case provides the most evidence against the notion that the two methods produce data from the same distribution, resulting in the smallest possible p-value³: p = 0.05. In other words, even before the data were collected it could have been known that this analysis would give the result the authors wanted.⁴

When I canvassed the Open Thread for interest in this article, Douglas Knight wrote: "If it's really frequentism that caused the problem, please spell this out." Frequentism per se is not the proximate cause of this problem, that being that the authors either never noticed that their analysis could not falsify their hypothesis, or they tried to pull a fast one. But it is a distal cause, in the sense that it forbids the Bayesian approach, and thus requires practitioners to become familiar with a grab-bag of unrelated methods for statistical inference⁵, leaving plenty of room for confusion and malfeasance. Technologos's reply to Douglas Knight got it exactly right; I almost jokingly requested a spoiler warning.

¹ I don't mind that it wouldn't be too hard to figure out who I am based on this paragraph. I just use a pseudonym to keep Google from indexing all my blog comments to my actual name.

² It's rather odd to report a p-value that is exactly equal to the significance threshold, one of many suspicious things about this analysis (the rest of which I've left out as they are not directly germane).

3 For those anxious to check my math, I've omitted some blah blah blah about one- and two-sides tests and alternative hypotheses.

⁴ I quickly emailed the reviewer; it didn't make much difference, because when we initially talked about the analysis we had noticed enough other flaws that he had decided to recommend rejection. This was just the icing on the coffin.

⁵ ... none of which actually address the question OF DIRECT INTEREST! ... phew. Sorry.

Probability & Statistics

Personal Blog

46

New Comment

Rendering 0/100 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 10:38 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

46 Case study: abuse of frequentist statistics

by Cyan

21st Feb 2010

4 min read

100

46

¹ I don't mind that it wouldn't be too hard to figure out who I am based on this paragraph. I just use a pseudonym to keep Google from indexing all my blog comments to my actual name.

3 For those anxious to check my math, I've omitted some blah blah blah about one- and two-sides tests and alternative hypotheses.

⁵ ... none of which actually address the question OF DIRECT INTEREST! ... phew. Sorry.

Probability & Statistics

Personal Blog

46

Mentioned in

164Your intuitions are not magic

10800 scientist call out against statistical significance

New Comment

Rendering 0/100 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 10:38 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Cyan

Curated and popular this week

100Comments

100

Comment Permalink

Cyan16y10

Thanks for the pointer to the original paper.

I'm not seeing why what you call "the real WTF" is evidence of a problem with frequentist statistics.

Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don't have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.

I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!

Me too. But not all papers with shoddy statistics are sent to statisticians for review. Experimental biologists in particular have a reputation for math-phobia. (Does the fact that when I saw the sample size the word "underpowered" instantly jumped into my head count as evidence that I am competent?)

PhilGoetz16y-30

Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don't have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.

Well, I don't see the structural problems. (I don't even know what a structural problem is.)

Somebody, please write a top-level post addressing this. Stop saying "Frequentists are bad" and leaving it at that. This is a great story; but it's not valid argumentation to try to convert it into an anti-frequentist tract.

5[anonymous]16y

I think that, in this case, the underlying problem was not caused by the way frequentist statistics are commonly taught and practiced by working scientists: [...] I'm no statistician, but I'm pretty sure you're not supposed to make your favored hypothesis the null hypothesis. That's a pretty simple rule and I think it's drilled into students and enforced in peer review. I see that as the underlying problem because it reverses the burden of proof. If they had done it the right way around, six data points would have been not enough to support their method instead of being not enough to reject it. Making your favored hypothesis the null hypothesis can allow you, in the extreme, to rely on a single data point.

14cupholder16y

I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that's trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.) [...] Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)

See in context