The Universal Medical Journal Article Error

PhilGoetz

TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:

∀x∃y P(x,y)

Now negate it:

!∀x∃y P(x,y)

You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)

If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)

To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

Divide subjects into a test group and a control group.
Administer the intervention to the test group, and a placebo to the control group.
Take some measurement that is supposed to reveal the effect they are looking for.
Compute the mean and standard deviation of that measure for the test and control groups.
Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

response = effect + normally distributed error

where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,

P(average effect(test) - average effect(control)) < Z = C

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.

Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.

If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.

So. Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.

For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.

[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

∀x∃y P(x,y)

Now negate it:

!∀x∃y P(x,y)

You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)

If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)

To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

Divide subjects into a test group and a control group.
Administer the intervention to the test group, and a placebo to the control group.
Take some measurement that is supposed to reveal the effect they are looking for.
Compute the mean and standard deviation of that measure for the test and control groups.
Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

response = effect + normally distributed error

P(average effect(test) - average effect(control)) < Z = C

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

So. Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey.

It's true I didn't do any multiple correction for the 2012 survey, but I think you're simply not understanding the point of multiple correction.

First, 'Data exploration' is precisely when you don't want to do multiple correction, because when data exploration is being done properly, it's being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won't wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It's one thing to explore data and find no interesting relationships at all (shit happens), but it's another thing entirely to set up procedures which nearly guarantee that you'll ignore any relationships you do find. And which multiple correction, anyway? I didn't come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)

Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we're forced to increase the false negatives. But this is just an online survey. It's done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It's also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)

This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I've pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won't see any multiple correction in my exploratory weather analysis.

What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you're just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end.

See above on why this is pointless and inappropriate.

That level of correction then also saves your 'noticing' something interesting and checking it specifically being circular (because you were already checking 'everything' and correcting appropriately).

If you were doing it at the end, then this sort of 'double-testing' would be a concern as it might lead your "actual" number of tests to differ from your "corrected against" number of tests. But it's not circular, because you're not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that's why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.

So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don't care enough, so I haven't.)