I think you're interpreting the F test a little more strictly than you should. Isn't it fairer to say a null result on a F test is "It is not the case that for most x, P(x)", with "most" defined in a particular way?
You're correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I'll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn't make the mistake of saying "artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children." The paper says "artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents' expectation is mistaken at the 95% confidence level."
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn't have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that's not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
If artificial additives affect only a small proportion of hyperactive children, significant dietary effects are unlikely to be detected in heterogeneous samples of hyperactive children. Therefore, children who had been placed on the Feingold diet by their parents and who were reported by their parents to have derived marked behavioral benefit from the diet and to experience marked deterioration when given artificial food colorings were targeted for this study. This sampling approach, combined with high dosage, was chosen to maximize the likelihood of observing behavioral deterioration with ingestion of artificial colorings.
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn't allow or even encourage you to conclude anything.
I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:
∀x∃y P(x,y)
Now negate it:
!∀x∃y P(x,y)
You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:
∀x∃y !P(x,y)
If you could just move the negation inward like that, then these claims would mean the same thing:
A) Not everything is a raven: !∀x raven(x)
B) Everything is not a raven: ∀x !raven(x)
To move a negation inside quantifiers, flip each quantifier that you move it past.
!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)
Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:
response = effect + normally distributed error
where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,
P(average effect(test) - average effect(control)) < Z = C
ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.
Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!
If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.
So. Rephrased to say precisely what the study found:
Converted to logic (ignoring time):
!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )
Move the negation inside the quantifier:
∃child !( eats(child, coloring) ⇨ behaviorChange(child) )
Translated back into English, this study proved:
However, this is the actual final sentence of that paper:
Translated into logic:
!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )
or, equivalently,
∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )
This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.
A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."
But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.
For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.
[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.
[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.
[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.