The Universal Medical Journal Article Error

PhilGoetz

6 The Universal Medical Journal Article Error

29th Apr 2014

6 min read

6

TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:

∀x∃y P(x,y)

Now negate it:

!∀x∃y P(x,y)

You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)

If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)

To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

Divide subjects into a test group and a control group.
Administer the intervention to the test group, and a placebo to the control group.
Take some measurement that is supposed to reveal the effect they are looking for.
Compute the mean and standard deviation of that measure for the test and control groups.
Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

response = effect + normally distributed error

where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,

P(average effect(test) - average effect(control)) < Z = C

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.

Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.

If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.

So. Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.

For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.

[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

Personal Blog

6

New Comment

Rendering 0/191 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:40 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

6 The Universal Medical Journal Article Error

by PhilGoetz

29th Apr 2014

6 min read

191

6

TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).

∀x∃y P(x,y)

Now negate it:

!∀x∃y P(x,y)

You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)

If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)

To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)

Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.

The methodology used in this study, and in most studies, is as follows:

Divide subjects into a test group and a control group.
Administer the intervention to the test group, and a placebo to the control group.
Take some measurement that is supposed to reveal the effect they are looking for.
Compute the mean and standard deviation of that measure for the test and control groups.
Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

response = effect + normally distributed error

P(average effect(test) - average effect(control)) < Z = C

ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.

Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!

So. Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )

Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )

A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."

Personal Blog

6

New Comment

Rendering 0/191 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 11:40 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from PhilGoetz

Curated and popular this week

191Comments

191

Comment Permalink

TitaniumDragon13y110

The problem is that you don't understand the purpose of the studies at all and you're violating several important principles which need to be kept in mind when applying logic to the real world.

Our primary goal is to determine net harm or benefit. If I do a study as to whether or not something causes harm or benefit, and see no change in underlying rates, then it is non-harmful. If it is making some people slightly more likely to get cancer, and others slightly less likely to get cancer, then there's no net harm - there are just as many cancers as there were before. I may have changed the distribution of cancers in the population, but I have certainly not caused any net harm to the population.

This study's purpose is to look at the net effect of the treatment. If we see the same amount of hyperactivity in the population prior to and after the study, then we cannot say that the dye causes hyperactivity in the general population.

"But," you complain, "Clearly some people are being harmed!" Well yes, some people are worse off after the treatment in such a theoretical case. But here's the key: for the effect NOT to show up in the general population, then you have only three major possibilities:

1) The people who are harmed are such a small portion of the population as to be statistically irrelevant.

2) There are just as many people who are benefitting from the treatment and as such NOT suffering from the metric in question, who would be otherwise, as there are people who would not be suffering from the metric without the treatment but are as a result of it. (this is extremely unlikely, as the magnitude of the effects would have to be extremely close to cancel out in this manner)

3) There is no effect.

If our purpose is to make [b]the best possible decision with the least possible amount of money spent[/b] (as it should always be), then a study on the net effect is the most efficient way of doing so. Testing every single possible SNP substitution is not possible, ergo, it is an irrational way to perform a study on the effects of anything. The only reason you would do such a study is if you had good reason to believe that a specific substitution had an effect either way.

Another major problem you run into when you try to run studies "your way" (more commonly known as "the wrong way") is the blue M&M problem. You see, if you take even 10 things, and test them for an effect, you have a 40% chance of finding at least one false correlation. This means that in order to have a high degree of confidence in the results of your study, you must increase the threshold for detection - massively. Not only do you have to account for the fact that you're testing more things, you also have to account for all the studies that don't get published which would contradict your findings (publication bias - people are far more likely to report positive effects than non-effects).

In other words, you are not actually making a rational criticism of these studies. In fact, you can see exactly where you go wrong:

[quote]If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it.[/quote]

While possible, how [b]likely[/b] is this? The answer is "Not very." And given Occam's Razor, we can mostly discard this barring evidence to the contrary. And no, moronic parents are not evidence to the contrary; you will find all sorts of idiots who claim that all sorts of things that don't do anything do something. Anecdotes are not evidence.

This is a good example of someone trying to apply logic without actually trying to understand what the underlying problem is. Without understanding what is going on in the first place, you're in real trouble.

I will note that your specific example is flawed in any case; the idea that these people are in fact being effected is deeply controvertial, and unfortunately a lot of it seems to involve the eternal crazy train (choo choo!) that somehow, magically, artifically produced things are more harmful than "naturally" produced things. Unfortunately this is largely based on the (obviously false and irrational) premise that things which are natural are somehow good for you, or things which are "artificial" are bad for you - something which has utterly failed to have been substantiated by and large. You should always automatically be deeply suspect of any such people, especially when you see "parents claim".

The reason that the FDA says that food dyes are okay is because there is no evidence to the contrary. Food dye does not cause hyperactivity according to numerous studies, and in fact the studies that fail to show the effect are massively more convincing than those which do due to publication bias and the weakness of the studies which claim positive effects.

Decius13y00

Suppose their exists a medication that kills 10% of the rationalists who take it (but kills nobody of other thought patterns), and saves the lives of 10% of the people who take it, but only by preventing a specific type of heart disease that is equally prevalent in rationalists as in the general population.

A study on the general population would show benefits, while a study on rationalists would show no effects, and a study on people at high risk for a specific type of heart disease would show greater benefits.

Food dye is allegedly less than 95% likely to ... (read more)

6PhilGoetz13y

Correct. But neither can we say that the dye does not cause hyperactivity in anyone. [...] Like that. That's what we can't say from the result of this study, and some other similar studies. For the reasons I explained in detail above. Your making the claim "no evidence to the contrary" shows that you have not read the literature, have not done a PubMed search on "ADHD, food dye", and have no familiarity with toxicity studies in general. There is always evidence to the contrary. An evaluation weighs the evidence on both sides. You can take any case where the FDA has said "There is no evidence that X", and look up the notes from the panel they held where they considered the evidence for X and decided that the evidence against X outweighed it. If you believe that there is no evidence that food dyes cause hyperactivity, fine. That is not the point of this post. This post analyzes the use of a statistical test in one study, and shows that it was used incorrectly to justify a conclusion which the data does not justify. [...] (A) I analyzed their use of math and logic in an attempt to prove a conclusion, and showed that they used them incorrectly and their conclusions are therefore not logically correct. They have not proven what they claim to have proven. (B) The answer is, "This is very likely." This is how studies turn out all the time, partly due to genetics. Different people have different genetics, different bacteria in their gut, different lifestyles, etc. This makes them metabolize food differently. It makes their brain chemistry different. Different people are different. [...] That's one of the problems I was pointing out! The F-test did not pass the threshold for detection. The threshold is set so that things that pass it are considered to be proven, NOT so that things that don't pass it are considered disproven. Because of the peculiar nature of an F-test, not passing the threshold is not even weak evidence that the hypothesis being tested is false.

See in context