You claim that medical researchers are doing logical inference incorrectly. But they are in fact doing statistical inference and arguing inductively.
Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.
Rephrased to say precisely what the study found:
This study tested and rejected the hypothesis that artificial food coloring causes hyperactivity in all children.
No. Mattes and Gittelman's finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners. For instance, they repeatedly pointed out that they “[studied] only children who were already on the Feingold diet and who were reported by their parents to respond markedly to artificial food colorings.” They claim that this is important because “the Feingold diet hypothesis did not originate from observations of carefully diagnosed children but from anecdotal reports on children similar to the ones we studied.” In other words, they are making an inductive argument:
[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012. ungated
[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8. ungated
[3 open access] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.
I wouldn't have posted this if I'd noticed earlier links, but independent links are still useful.
Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:
response = effect + normally distributed error
where the effect is the same for every subject.
The F test / t test doesn't quite say that. It makes statements about population averages. More specifically, if you're comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).
The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:
The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.
Translated into logic:
!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )
That's an uncharitable interpretation of that sentence. It would mean that if there was a word such as “any” before the phrase “school-age children”, but there isn't. The zero article before plural nouns in English doesn't generally denote an universal quantifier; “men are taller than women” doesn't mean ∀x ∈ {men} ∀y ∈ {women} x.height > y.height. The actual meaning of the zero article before plural nouns in English is context-dependent and non-trivial to formalize.
Are you a non-native English speaker by any chance? (So am I FWIW, but the definite article in my native language has a very similar meaning to the zero article in English in contexts like these.)
If whether this particular paper exemplifies this error is disputed (as it appears to be!) and the author's claim that he "cannot recall ever seeing a medical journal article prove a negation and not make this mistake" is correct, then it should be easy for the author to give several more examples which more clearly display the argument given here. I would encourage PhilGoetz or someone else to do so.
Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.
Rephrased to say precisely what the study found:
This study tested and rejected the hypothesis that artificial food coloring causes hyperactivity in all children.
Interesting. Those two statements seem quite different; more than just a rephrasing.
Probabilistically, it sounds like the study found P(hyper|dye) = P(hyper|~dye)
, that is they rejected P(hyper|dye) > P(hyper|~dye)
, and concluded P(hyper|dye) = P(hyper|~dye)
(no connection) correctly.
I think your logical interpretation of their result throws out most of the information. Yes they concluded that it is not true that all children that ate dye were hyperactive, but they also found that the proportion of dye-eaters who were hyperactive was not different from the base rate, which is a much stronger statement, which does imply their conclusion, but can't be captured by the logical formulation you gave.
Translated back into English, this study proved:
There exist children for whom artificial food coloring does not affect behavior.
The whole point of inductive reasoning is that this is evidence for artificial food coloring not affecting the behavior of any children (given a statistically significant sample size). You cannot do purely deductive reasoning about the real world and expect to get anything meaningful. This should be obvious.
The problem is that you don't understand the purpose of the studies at all and you're violating several important principles which need to be kept in mind when applying logic to the real world.
Our primary goal is to determine net harm or benefit. If I do a study as to whether or not something causes harm or benefit, and see no change in underlying rates, then it is non-harmful. If it is making some people slightly more likely to get cancer, and others slightly less likely to get cancer, then there's no net harm - there are just as many cancers as there were before. I may have changed the distribution of cancers in the population, but I have certainly not caused any net harm to the population.
This study's purpose is to look at the net effect of the treatment. If we see the same amount of hyperactivity in the population prior to and after the study, then we cannot say that the dye causes hyperactivity in the general population.
"But," you complain, "Clearly some people are being harmed!" Well yes, some people are worse off after the treatment in such a theoretical case. But here's the key: for the effect NOT to show up in the general population, then you have only ...
I've similarly griped here in the past about the mistaken ways medical tests are analyzed here and elsewhere, but I think you over complicated things.
The fundamental error is misinterpreting a failure to reject a null hypothesis for a particular statistical test, a particular population, and a particular treatment regime as a generalized demonstration of the null hypothesis that the medication "doesn't work". And yes, you see it very often, and almost universally in press accounts.
You make a good point about how modeling response = effect + error leads to confusion. I think the mistake is clearer written as "response = effect + noise", where noise is taken as a random process injecting ontologically inscrutable perturbations of the response. If you start with the assumption that all differences from the mean effect are due to ontologically inscrutable magic, you've ruled out any analysis of that variation by construction.
If that meant the same thing, then so would these claims
OK, I may be dense today, but you lost me there. I tried to puzzle out how the raven sentences could be put symbolically so that they each corresponded to one of the negations of your original logic sentence, and found that fruitless. Please clarify?
The rest of the post made sense. I'll read through the comments and figure out why people seem to be disagreeing first, which will give me time to think whether to upvote.
If 11 out of 11 children studied have a property (no food coloring hyperactivity response), that's a bit stronger than "there exist 11 children with this property", though perhaps not quite "all children have this property".
You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.
Or rather, you can conclude that, if there were no effect of food dye on hyperactivity and we did this test a whole lotta times, then we'd get data like this 16% of the time, rather than beneath the 5%-of-the-time maximum cutoff you were hoping for.
It's not so easy to jump from frequentist confidence intervals to confidence for or against a hypothesis. We'd need a bunch of assumptions. I don't have access to the original article so I'll just make ...
I think part of the problem is that there is a single confidence threshold, usually 90%. The problem is that setting the threshold high enough to compensate for random flukes and file drawer effects causes problems when people start interpreting threshold - epsilon to mean the null hypothesis has been proven. Maybe it would be better to have two thresholds with results between them interpreted as inconclusive.
This post makes a point that is both correct and important. It should be in Main.
This post makes a point that is both correct and important. A post that makes this point should be in Main.
The reception of this post indicates that the desired point is not coming through to the target audience. That matters.
Not even that. It takes the zero-article plural as used in everyday language and pretends it is intended to be precisely the same as the logical "all" operator, which it of course it is not.
Does what you're saying here boil down to "failing to reject the null (H0) does not entail rejecting the alternative (H1)"? I have read this before elsewhere, but not framed in quantifier language.
I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.
Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:
Only a small fraction of the children were affected by the intervention.
Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direct
When people do studies of the effects of food coloring on children, are the children blindfolded?
That is, can the studies discern the neurochemical effects of coloring molecules from the psychological effects of eating brightly-colored food?
I expect that beige cookies are not as exciting as vividly orange cookies.
My read of the Mattes & Gittelman paper is that they're comparing natural and artificial food coloring.
Moreover, no type of rater (parents, teachers, psychiatrists, nor children) guessed beyond chance the type of cookie.
The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
I think that should be: The tests compute what is a difference in magnitude of response such that, if the null hypothesis is true, then 95% of the time the measured effect difference will not be that large.
Frequentist statistics cannot make the claim that with some probabilty the null hypothesis is true or false. Ever. You must have a prior and invoke Bayes theorem to do that.
I'm not as interested in proving my point, as in figuring out why people resist it so strongly. It seems people are eager to disagree with me and reluctant to agree with me.
How did the post make you feel, and why?
I've found previously that many people here are extremely hostile to criticisms of the statistical methods of the medical establishment. It's extremely odd at a site that puts Jaynes on a pedestal, as no one rants more loudly and makes the case clearer than Jaynes did, but there it is.
(Eliezer does, anyway. I can't say I see very many quotes or invocations from others.)
I am hostile to some criticisms, because in some cases when I see them being done online, it's not in the spirit of 'let us understand how these methods make this research fundamentally flawed, what this implies, and how much we can actually extract from this research'*, but in the spirit of 'the earth is actually not spherical but an oblate spheroid thus you have been educated stupid and Time has Four Corners!' Because the standard work has flaws, they feel free to jump to whatever random bullshit they like best. 'Everything is true, nothing is forbidden.'
* eg. although extreme and much more work than I realistically expect anyone to do, I regard my dual n-back meta-analysis as a model of how to react to potentially valid criticisms. Instead of learning that passive control groups are a serious methodological iss...
I think that the universal quantifier in
!( ∀child ( eats(child, coloring) ⇨ hyperactive(child) ) )
is not appropriate.
The original statement
artificial food coloring causes hyperactivity in all children.
only implicates that artificial food coloring was responsible for all children's hyperactivity, not that children who ever ate artificial food coloring would inevitably have hyperactivity. So the formula without universal quantifier is more reasonable and thus the final statement of the article is without problem.
It would've been very helpful if some sort of glossary or even a Wikipedia link was provided before diving into the use of the notational characters such as those used in "∀x !P(x)".
Although this post covers an important topic, the first few sentences almost lost me completely, even though I learned what all those characters meant at one time.
And, as LessWrong is rather enamored with statistics, consider that by writing P(x,y), the readers have an exactly 50% chance of getting the opposite meaning unless they have very good recall. :)
This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.
That would be interesting if true. I recommend finding another one, since you sya they're so plentiful. And I also recommend reading it carefully, as the study you chose to make an example of is not the study you were looking for. (If you don...
Unfortunately, there's an error in your logic: You call that type of medical journal article error "universal", i.e. applicable in all cases. Clearly a universal quantifier if I ever saw one.
That means that for all medical journal articles, it is true that they contain that error.
However, there exists a medical journal article that does not contain that error.
Hence the medical journal error is not universal, in contradiction to the title.
First logical error ... and we're not even out of the title? Oh dear.
I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:
∀x∃y P(x,y)
Now negate it:
!∀x∃y P(x,y)
You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:
∀x∃y !P(x,y)
If you could just move the negation inward like that, then these claims would mean the same thing:
A) Not everything is a raven: !∀x raven(x)
B) Everything is not a raven: ∀x !raven(x)
To move a negation inside quantifiers, flip each quantifier that you move it past.
!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)
Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:
response = effect + normally distributed error
where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,
P(average effect(test) - average effect(control)) < Z = C
ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.
Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!
If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.
So. Rephrased to say precisely what the study found:
Converted to logic (ignoring time):
!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )
Move the negation inside the quantifier:
∃child !( eats(child, coloring) ⇨ behaviorChange(child) )
Translated back into English, this study proved:
However, this is the actual final sentence of that paper:
Translated into logic:
!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )
or, equivalently,
∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )
This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.
A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."
But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.
For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.
[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.
[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.
[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.