I don't think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
- Divide subjects into a test group and a control group.
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
If you don't understand why that is so, read the articles about the t-test and the F-test. The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.
No, the correct form is:
- The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye causes hyperactivity. That failure to prove is then taken as having proved that food dye does not cause hyperactivity, even though the evidence indicated that food dye causes hyperactivity.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman's central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I've pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I've been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn't notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you're just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your 'noticing' something interesting and checking it specifically being circular (because you were already checking 'everything' and correcting appropriately).