Quick check: do you agree or disagree with the following statement:

If a study finds a result significant at a p=0.05 level, that means they have followed a methodology which produces this conclusion correctly 95 % of the time.

Yes or no? Keep that in mind, and we’ll get back to it.

I’m reading the Fisher book where he popularised the p-value[1], and I noticed he’s actually quite sensible about it:

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available.

He is talking here about the normal distribution, and saying that if you have two dice that somehow generate numbers from the normal distribution, and you get an unexpectedly large value from the d6, you should double check that you are not accidentally throwing the d20. Makes complete sense.

It turns out that many situations in hypothesis testing are equivalent to “Wait am I still holding a d6?” so this is a useful rule of thumb.

But! In science there are so many other things that can go wrong. For example, Bakker and Wicherts[2] found that 15 % of the studies they looked at drew the wrong conclusion due to making dumb mistakes in computing the significance threshold.

Think about that! The significance test pales in comparison.

Regardless of what level of significance is used in the hypothesis test, regardless of accuracy effects from selection pressure, the base rate of getting the most fundamental maths of the last step right is only 85 %. Then other problems are piled on top of that[3], so no, a significant result at p=0.05 means nothing. It’s just a sign that you might be holding another die and it is time to double-check. (E.g. through replication, or further investigation.)

  1. ^

    Statistical Methods for Research Workers; Fisher; Oliver & Boyd; 1925.

  2. ^

    The (mis)reporting of statistical results in psychology journals; Bakker, Wicherts; Behaviour Research Methods; 2011.

  3. ^

    Consider for example the Forbes report of 88 % of spreadsheets containing errors.

New Comment
7 comments, sorted by Click to highlight new comments since:

My initial answer to your starting question was "I disagree with this statement because they likely used 20 different ways of calculating the p-value and selected the one that was statistically significant". Also https://en.m.wikipedia.org/wiki/Replication_crisis

If a study finds a result significant at a p=0.05 level, that means they have followed a methodology which produces this conclusion correctly 95 % of the time.

 

This smells like a confidence interval; is that what you had in mind? I think the default definition of p-values is more like "this means that their methodology would produce this conclusion at most 5% of the time if the conclusion was actually false."

(rest of the article still holds; if their chance of drawing the wrong conclusion from their data is 15%, the claimed p-value becomes completely useless until their methodology is checked)

I'm here to say, this is not some property specific to p-values, just about the credibility of the communicator.

If  make a bunch of errors all the time, especially those that change their conclusions, indeed you can't trust them. Turns out (BW11) that  are more credible than , the errors they make tend not to change the conclusions of the test (i.e., the chance of drawing a wrong conclusion from their data ("gross error" in BW11) was much lower than the headline rate), and (admittedly I'm going out on a limb here) it is very possible the errors that change the conclusion of a particular test do not change the overall conclusion about the general theory (e.g., if theory says X, Y, and Z should happen, and you find support for X and Y and marginal-support-now-not-significant-support-anymore for Z, the theory is still pretty intact unless you really care about using p-values in a binary fashion. If theory says X, Y, and Z should happen, and you find support for X and Y and now-not-significant-support-anymore for Z, that's more of an issue. But given how many tests are in a paper, it's also possible theory says X, Y, and Z should happen, and you find support for X and Y and Z, but turns out your conclusion about W reverses, which may or may not really have something to say about your theory).

I don't think it is wise to throw the baby out with the bathwater.

Do you expect anyone to answer "agree" to the starting question?

Agree. I have never understood why p=0.05 is a holy threshold. People (and journals) toss out research if they get p=0.06 but they think they are their way to a Nobel prize with p=0.04. Madness.

Fisher, The arrangement of field experiments, Journal of the Ministry of Agriculture (1926):

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

This is why. Relevant smbc.

yeah, that comic summarize it all!

As a side note, I wonder how many would get their PhD degree if the requirement was to publish 3-4 papers (2 single author and 1-2 co-author) were the main result (where it's applicable) needed to have p<0.01? Perhaps the paper publishing frenzy would slow down a little bit if the monograph came into fashion again?