TL;DR: When people read a journal article that concludes, "We have proved that it is not the case that for every X, P(X)", they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.

Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).

A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).


I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement "For every X there is some Y such that P(X,Y)" and represent it symbolically:

∀x∃y P(x,y)


Now negate it:

!∀x∃y P(x,y)


You often don't want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:

∀x∃y !P(x,y)


If you could just move the negation inward like that, then these claims would mean the same thing:

A) Not everything is a raven: !∀x raven(x)

B) Everything is not a raven: ∀x !raven(x)


To move a negation inside quantifiers, flip each quantifier that you move it past.

!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)


Here's the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:


Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Now pay attention; this is the part everyone gets wrong, including most of the commenters below.


The methodology used in this study, and in most studies, is as follows:

  • Divide subjects into a test group and a control group.
  • Administer the intervention to the test group, and a placebo to the control group.
  • Take some measurement that is supposed to reveal the effect they are looking for.
  • Compute the mean and standard deviation of that measure for the test and control groups.
  • Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
  • If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
  • If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).

People make the error because they forget to explicitly state what quantifiers they're using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

response = effect + normally distributed error


where the effect is the same for every subject. If you don't understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,

P(average effect(test) - average effect(control)) < Z = C


ADDED: People are making comments proving they don't understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.


Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.


If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.


Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!


If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.


So. Rephrased to say precisely what the study found:


This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.

Converted to logic (ignoring time):

!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )


Move the negation inside the quantifier:

∃child !( eats(child, coloring) ⇨ behaviorChange(child) )


Translated back into English, this study proved:


There exist children for whom artificial food coloring does not affect behavior.

However, this is the actual final sentence of that paper:


The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )


or, equivalently,

∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )


This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.


A lot of people are complaining that I should just interpret their statement as meaning "Food colorings do not affect the behavior of MOST school-age children."


But they didn't prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn't remotely close to what they have claimed.


For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.



[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.


[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.


[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

The Universal Medical Journal Article Error
New Comment
191 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

You claim that medical researchers are doing logical inference incorrectly. But they are in fact doing statistical inference and arguing inductively.

Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.

Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring causes hyperactivity in all children.

No. Mattes and Gittelman's finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners. For instance, they repeatedly pointed out that they “[studied] only children who were already on the Feingold diet and who were reported by their parents to respond markedly to artificial food colorings.” They claim that this is important because “the Feingold diet hypothesis did not originate from observations of carefully diagnosed children but from anecdotal reports on children similar to the ones we studied.” In other words, they are making an inductive argument:

  1. Most evidence for the Feingold diet hypothesis comes from anecdotal reports.
  2. Most of these anecdotal reports are mistaken.
  3. Thus,
... (read more)
3Kawoomba
The number of upvotes for the OP is depressing. It's a good example for forcing your toolset into every situation you encounter. If all you have is a hammer ...
1David_Gerard
Don't worry, we'll have Metamed to save us!
0PhilGoetz
Well, dammit, I wanted to delete this and rewrite above, but you can't delete comments anymore. This is not retracted, but I can't un-retract it. You are wrong, and you have not learned to reconsider your logic when many smart people disagree with you.
2Douglas_Knight
You can delete retracted comments by reloading the page and clicking on a new delete icon that replaces the retract icon.
4arundelo
Only if no-one's replied to them.
4TimS
I'm not sure that's true. See here
5ArisKatsaris
People can reply to any comments that they still see in their browser page, even though they've been "deleted", if the replier has not refreshed said browser page. EDIT TO ADD: As I see that wedrifid also mentions below.
0Eugine_Nier
Possibly there is also a similar effect if the deleter hasn't refreshed his browser page.
0wedrifid
Possibly. Specifically it would be if you (as the example) had retracted the page then refreshed it (to get the 'delete' button available to you) and then there is an arbitrary period of time after which you click the delete button without first refreshing again. (Untested, but those are the circumstances under which it would be at all possible if the code is not specifically designed to prevent it.)
0Douglas_Knight
Why are you not sure of facts that are subject to easy experiments? (update: arundelo is correct)
4TimS
Experiment clutters the venue, and being less blunt avoids the appearance of a status conflict.
0Douglas_Knight
If deletion is possible, there is very little clutter. If deletion is not possible, and the comment says "I can't figure out how to delete this," at least it discourages other people's experiments. But this thread is itself clutter, so I don't think that is your true rejection. As to bluntness, I conclude that my being less blunt caused you to confabulate bullshit. PS - I experiment on the open thread.
3TimS
On reflection, it is probably more accurate for me to say, "I wasn't interested in experimenting, including for concern that the experimenting would look low status, and I have higher preferred ways of acting low status." As for my own choice not to be blunt, you are not correctly modelling my thought process. In short, I gave two reasons for my action, and you might be right that one was confabulation, but not the one you identify as confabulation.
2wedrifid
I have performed the experiment in question and it seems to support arundelo's claim. I am not able to remove this comment. At the very least it demonstrates that the experiment required to prove arundelo's fully general claim is false is not the 'easy' one.
0TimS
Well, now I'm totally confused. Checking Eugine_Nier's account on ibiblio shows that the comment is missing. (Searching for the word "sarcasm" will get you to about when the comment took place, at least as of the date of this comment)
0wedrifid
See my investigation. Short answer: race condition.
0Douglas_Knight
Thanks actually experimenting. My beliefs were two months out of date. I stand by my objection to Tim's comment.
-2wedrifid
It is possible that the comment was banned by a moderator rather than deleted by the author. (If so, it will still appear if you look at the user's comment page.) After retraction EDIT: TimS. I can't seem to delete this comment even after refreshing.
2TimS
As it happens, I remember what Eugine_Nier wrote, and I am certain it did not meet the local criteria for mod-blocking. (Anonymous downvoter: What is it in wedrifid's post you'd like to see less of? Helpful commentary about the mechanics of this site is not on my list of things to downvote).
9wedrifid
Interesting. This suggests that a feature has changed at some point since the retraction-then-delete feature was first implemented. (I have memories of needing to be careful to edit the text to blank then retract so as to best emulate the missing 'delete' feature.) I notice that I am confused. Investigates. * Testing deletion feature. Deletion of (grandparent) comment that you have already replied to: Fail. It is still not (usually) possible to delete comments with replies. * Check for moderator deletion. (ie. Moderator use of the ban feature, actual delete per se is extremely rare). Confirm absence of a reply on Eugine_Nier's page that fits that part of history. The comment is, indeed, deleted not banned. * Check timestamps for plausibility of race condition. Ahh. Yes. Tim, you replied to Eugine within 3 minutes of him writing the comment. This means that most likely Eugine deleted his message while you were writing your reply. Your comment was still permitted to be made despite the deleted parent. The reverse order may also be possible, depending on the details of implementation. Either way, the principle is the same.
0TimS
ArisKatsaris suggests browser refresh, not timestamps, is the issue.
2wedrifid
He is describing the same phenomenon. The timestamps give an indication as to how likely the race condition is to occur based on the delays between GETs and POSTs. If the comments were a day apart I would have tentatively suggested "Perhaps one of you deleted or replied to a comments page that was old?". Whereas given that the timestamps were within 3 minutes I could more or less definitively declare the question solved.
2TimS
Thanks. I'm not technologically fluent enough to tell the difference between what you said and what he said without the explanation.
2Eugine_Nier
For the record, I did in fact delete the comment.
-1buybuydandavis
Jaynes argued that probability theory was an extension of logic, so this seems like quite a quibbling point. They do, but did the paper he dealt with write within a Bayesian framework? I didn't read it, but it sounded like standard "let's test a null hypothesis" fare. Which is not a valid objection to Phil's analysis if Mattes and Gittelman weren't doing a Bayesian analysis in the first place. Were they? I'll apologize for not checking myself if I'm wrong, but right now my priors are extremely low so I don't see value in expending the effort to verify. If they did their calculations in a Bayesian framework. Did they?
7Matt_Simpson
You don't just ignore evidence because someone used a hypothesis test instead of your favorite Bayesian method. P(null | p value) != P(null)
2buybuydandavis
I ignore evidence when the evidence doesn't relate to the point of contention. Phil criticized a bit of paper, noting that the statistical analysis involved did not justify the conclusion made. The conclusion did not follow the analysis. Phil was correct in that criticism. It's just not an argument against Phil that someone might take some of the data in the paper and do a Bayesian analysis that the authors did not do.
5Matt_Simpson
That's not what I'm saying. I'm saying that what the authors did do IS evidence against the hypothesis in question. Evidence against a homogenous response is evidence against any response (it makes some response less likely)
3buybuydandavis
What they did do? Are you saying the measurements they took make their final claim more likely, or that their analysis of the data is correct and justifies their claim? Yes, if you arrange things moderately rationally, evidence against a homogenous response is evidence against any response, but much less so. I think Phil agrees with that too, and is objecting to a conclusion based on much less so evidence pretending to have much more justification than it does.
4Matt_Simpson
Ok, yeah, translating what the researchers did into a Bayesian framework isn't quite right either. Phil should have translated what they did into a frequentist framework - i.e. he still straw manned them. See my comment here.
-2PhilGoetz
I know that. That's not the point. They claimed to have proven something they did not prove. They did not present this claim in a Bayesian framework.
-6PhilGoetz
-7PhilGoetz

[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012. ungated

[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8. ungated

[3 open access] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.

I wouldn't have posted this if I'd noticed earlier links, but independent links are still useful.

Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:

response = effect + normally distributed error

where the effect is the same for every subject.

The F test / t test doesn't quite say that. It makes statements about population averages. More specifically, if you're comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).

The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:

  • The positive and negative effects cancel each other out exactly once you average across the population. (this seems very unlikely)
  • The population average effect size is nonzero but very small, possibly because the effect only oc
... (read more)
-4PhilGoetz
Why do you say that? Did you look at the data? They found F values of 0.77, 2.161, and 1.103. That means they found different behavior in the two groups. But those F-values were lower than the thresholds they had computed assuming homogeneity. They therefore said "We have rejected the hypothesis", and claimed that the evidence, which interpreted in a Bayesian framework might support that hypothesis, refuted it.
4Matt_Simpson
I didn't look at the data. I was commenting on your assessment of what they did, which showed that you didn't know how the F test works. Your post made it seem as if all they did was run an F test that compared the average response of the control and treatment groups and found no difference.

The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.

Translated into logic:

!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )

That's an uncharitable interpretation of that sentence. It would mean that if there was a word such as “any” before the phrase “school-age children”, but there isn't. The zero article before plural nouns in English doesn't generally denote an universal quantifier; “men are taller than women” doesn't mean ∀x ∈ {men} ∀y ∈ {women} x.height > y.height. The actual meaning of the zero article before plural nouns in English is context-dependent and non-trivial to formalize.

Are you a non-native English speaker by any chance? (So am I FWIW, but the definite article in my native language has a very similar meaning to the zero article in English in contexts like these.)

-9PhilGoetz
[-]tgb170

If whether this particular paper exemplifies this error is disputed (as it appears to be!) and the author's claim that he "cannot recall ever seeing a medical journal article prove a negation and not make this mistake" is correct, then it should be easy for the author to give several more examples which more clearly display the argument given here. I would encourage PhilGoetz or someone else to do so.

-26PhilGoetz
[-][anonymous]170

Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings ... This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.

Rephrased to say precisely what the study found:

This study tested and rejected the hypothesis that artificial food coloring causes hyperactivity in all children.

Interesting. Those two statements seem quite different; more than just a rephrasing.

Probabilistically, it sounds like the study found P(hyper|dye) = P(hyper|~dye), that is they rejected P(hyper|dye) > P(hyper|~dye), and concluded P(hyper|dye) = P(hyper|~dye) (no connection) correctly.

I think your logical interpretation of their result throws out most of the information. Yes they concluded that it is not true that all children that ate dye were hyperactive, but they also found that the proportion of dye-eaters who were hyperactive was not different from the base rate, which is a much stronger statement, which does imply their conclusion, but can't be captured by the logical formulation you gave.

-4PhilGoetz
You are making the same mistake by ignoring the quantification. The test used to reject P(hyper|dye) > P(hyper|~dye) uses a cutoff that is set from the sample size using the assumption that all the children have the identical response. They didn't find P(hyper|dye) = P(hyper|~dye), they rejected the hypothesis that for all children, P(hyper|dye) > P(hyper|~dye), and then inappropriately concluded that for all children, !P(hyper|dye) > P(hyper|~dye).

Translated back into English, this study proved:

There exist children for whom artificial food coloring does not affect behavior.

The whole point of inductive reasoning is that this is evidence for artificial food coloring not affecting the behavior of any children (given a statistically significant sample size). You cannot do purely deductive reasoning about the real world and expect to get anything meaningful. This should be obvious.

-4PhilGoetz
They measured a difference between the behavior of the test and the control group. They chose an F-value that this difference would have to surpass in order to prove the proposition that food color affects the behavior of all children. The specific number they chose requires the word "all" there. The differences they found were smaller than the F-value. We don't know whether the differences were or were not large enough to pass an F-value computed for the proposition that food color affects all but one child, or most children, or one-fifth of all children. Where, exactly, is the evidence that artificial food color doesn't affect the behavior of any children?
9lfghjkl
Then this is what you should have critiqued in your post. Ranting about their inductive reasoning being deductively wrong gets you nowhere. Since your post is the first time I've heard of this: I have no idea, but I assume google has the answer.
-3PhilGoetz
Why would I critique them for finding values smaller than the F-value? The values were smaller than the F-value. That means the test failed. What I then critiqued was their logical error in interpreting the test's failure. I mean where in the paper. There is no evidence in the paper that artificial food color doesn't affect the behavior of any children. Your claim that they are using inductive logic shows that you didn't understand the paper. Your response that I should have critiqued their not finding a high enough F-value shows you really don't have the first clue about what an F-test is. Please learn a little about what you're critiquing before you critique it so confidently in the future.
0lfghjkl
No, what you (originally) critiqued was the lack of rigorous deductive reasoning in their statistical analysis, as shown by both your introduction and conclusion solely focusing on that. Even whatever point you tried to make about the F-values was lost in a rant about deduction. In your own words you stated the following: And that sentence (if true as you claim) is inductive evidence for their conclusion. How many times do I have to tell you this? All statistical reasoning is inductive reasoning. You claiming the opposite shows that you don't understand statistics. Since you completely missed my point, I'll try again: Focus on critiquing what's statistically wrong in this paper, not what's deductively wrong. I simply chose that sentence as it seemed to be the most coherent one in your response. Now, you seem to be under the assumption that I am defending this paper and it's conclusion, so let me make it clear that I do not. I have neither read it, nor plan to. I merely found you attacking a statistical analysis for being deductively wrong, and chose to try and help you clear up whatever misunderstanding you had about statistics being a part of deductive reasoning. I'm guessing you've recently started learning about discrete mathematics, and seek to apply your new knowledge on quantifiers to everything you come across. Don't worry about being wrong, almost everyone goes through such a phase.
-4PhilGoetz
So you chose a sentence without understanding it. There is nothing statistically wrong with the paper. The error is not in the statistical analysis. The error is in the deductions they made when interpreting it. You are claiming that logic and statistics are like oil and water, and can never co-occur in the same paper. This is incorrect. As I mentioned in my post, I used to teach logic at a university. So now you have also proved you didn't read my post. And you have proved you don't know the difference between logic and discrete math. So know we know that you * don't know what an F-test is * didn't read the whole post * don't know the difference between logic and discrete math And I also kinda doubt you read the paper. Did you? I'm sorry that LessWrong has made you stupider, by giving you false confidence to speak with authority on matters you are ignorant of. As to being wrong, identify an error in anything I've said, instead of spouting platitudes about induction without any connection to specific facts and arguments.
-1lfghjkl
And the whole point of science is that it is built on inductive (and not deductive) reasoning. Well, I'll give you points for creativity in your straw man, at least. So, you were a TA then? No, it merely "proves" that I skimmed the personal biography part of your post in favor of focusing on your actual content. Please tell me how you "proved" this. Well, every course in discrete mathematics usually has at least a lesson or two on logic and quantifiers. I just assumed you learned it in such a setting since then you would have had an excuse to not understand it properly (as opposed to having spent an entire course focused solely on logic). Funny how you use the fact that I skimmed the non-essential parts of your original post as proof that I didn't read any it, and then goes on to completely ignore what I wrote here: I also find your usage of the word "proof" very interesting for someone who claims to have taught logic. Do you always insult people who try to help you? It might help your future discussions if you don't take criticism so personally. I have already done so numerous times. Maybe you should try to read my arguments instead of just skimming and assuming?
-2PhilGoetz
Identifying an error means taking a specific claim and showing a mistake. Saying "You cannot do purely deductive reasoning about the real world and expect to get anything meaningful" is not identifying an error. Saying "There exists a child for whom X" is inductive proof of "For all children, X" is ridiculous. It gives a tiny tiny bit of support, but not anything anyone would call a proof, any more than "2+2 = 4" is proof for "For all X, X+2 = 4." The paper is making errors, but not that one. If you find one child and prove that he's not affected by food dye, and you write a paper saying "This child's behavior is not affected by dye, therefore no children's behavior are affected by dye", it will not be published. That was not their intent. I doubt anyone has ever published a paper using the "inductive proof" you think is standard. In light of the fact that you didn't read the post closely, didn't read the paper, and don't understand how an F-test works, you really should stop being so confident. The claims you're making require you to have done all three. You are claiming that I interpreted their reasoning incorrectly, and you didn't read it!
0lfghjkl
It seems you are confused about how statistics work. When you wish to study if group X has property Y, you take a statistically significant sample from group X and see if this sample has property Y. You then use the results from this sample to conclude whether the group as a whole has property Y (with a high or low probability). And this conclusion is always an inductive conclusion, never deductive. As reported by you in your original post, their sample did not have the property they were looking for and they therefore concluded that the group as a whole does not have this property. You even reported that their statistics was sound. So, where is the error? Edit to add: In other words, every statistical study ever done has always had a conclusion of the following form: There exist a statistically significant sample where this property does (not) hold, therefore the property does (not) hold for the whole group. Which is just the general form of what you critiqued here: So, by critiquing this study for being deductively wrong, you are in fact critiquing every statistical study ever done for being deductively wrong. Do you now see the problem with this?
0PhilGoetz
Look, what you've written above is based on misunderstanding how an F-test works. I've already explained repeatedly why what you're saying here, which is the same thing you've said each time before, is not correct. This study contains a failure of an F-test. Because of how the F-test is structured, failure of an F-test to prove forall X P(X), is not inductive evidence, nor evidence of any kind at all, that P(X) is false for most X. I will try to be more polite, but you need to a) read the study, and b) learn how an F-test works, before you can talk about this. But I just don't understand why you keep making confident assertions about a study you haven't read, using a test you don't understand. The F-test is especially tricky, because you know you're going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That's a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can't say whether failure of an F-test is evidence for or against a hypothesis.
0lfghjkl
And I've repeatedly told you that you should've focused your critique on this instead of ranting about deduction. The last time I said it, you claimed the following: Now to answer your question: I haven't been discussing this study, I've been trying to help you understand why your critique of it has been misguided. As for this claim you undoubtedly have an interesting "proof" for, I've simply avoided confusing you further with a discussion of statistics until you realized the following: 1. All statistical conclusions are deductively wrong. 2. A statistical study must be critiqued for it's misuse of statistics (and obviously, then you must first claim that there is something statistically wrong with the paper).

Moved to Discussion. (Again.)

[-][anonymous]120

why is it back in main?

4Eliezer Yudkowsky
I assume Phil reposted it there. Now banning.
2Paul Crowley
Phil says it wasn't him above. I'd be somewhat surprised if that was a barefaced lie.
4wedrifid
Problem already solved. I had noticed this subthread didn't have acknowledgement of the resolution and considered whether it was necessary for me to post a note saying so. I decided that would be more spammy than helpful so I didn't. Error!
0Paul Crowley
Thank you!
3PhilGoetz
Wasn't me.

The problem is that you don't understand the purpose of the studies at all and you're violating several important principles which need to be kept in mind when applying logic to the real world.

Our primary goal is to determine net harm or benefit. If I do a study as to whether or not something causes harm or benefit, and see no change in underlying rates, then it is non-harmful. If it is making some people slightly more likely to get cancer, and others slightly less likely to get cancer, then there's no net harm - there are just as many cancers as there were before. I may have changed the distribution of cancers in the population, but I have certainly not caused any net harm to the population.

This study's purpose is to look at the net effect of the treatment. If we see the same amount of hyperactivity in the population prior to and after the study, then we cannot say that the dye causes hyperactivity in the general population.

"But," you complain, "Clearly some people are being harmed!" Well yes, some people are worse off after the treatment in such a theoretical case. But here's the key: for the effect NOT to show up in the general population, then you have only ... (read more)

6PhilGoetz
Correct. But neither can we say that the dye does not cause hyperactivity in anyone. Like that. That's what we can't say from the result of this study, and some other similar studies. For the reasons I explained in detail above. Your making the claim "no evidence to the contrary" shows that you have not read the literature, have not done a PubMed search on "ADHD, food dye", and have no familiarity with toxicity studies in general. There is always evidence to the contrary. An evaluation weighs the evidence on both sides. You can take any case where the FDA has said "There is no evidence that X", and look up the notes from the panel they held where they considered the evidence for X and decided that the evidence against X outweighed it. If you believe that there is no evidence that food dyes cause hyperactivity, fine. That is not the point of this post. This post analyzes the use of a statistical test in one study, and shows that it was used incorrectly to justify a conclusion which the data does not justify. (A) I analyzed their use of math and logic in an attempt to prove a conclusion, and showed that they used them incorrectly and their conclusions are therefore not logically correct. They have not proven what they claim to have proven. (B) The answer is, "This is very likely." This is how studies turn out all the time, partly due to genetics. Different people have different genetics, different bacteria in their gut, different lifestyles, etc. This makes them metabolize food differently. It makes their brain chemistry different. Different people are different. That's one of the problems I was pointing out! The F-test did not pass the threshold for detection. The threshold is set so that things that pass it are considered to be proven, NOT so that things that don't pass it are considered disproven. Because of the peculiar nature of an F-test, not passing the threshold is not even weak evidence that the hypothesis being tested is false.
4CronoDAS
People aren't that different. I really doubt that, for example, there are people whose driving skills improve after drinking the amount of alcohol contained in six cans of beer.
5Decius
You haven't searched hard: Consider the negative effects of high nervousness on driving skills, the nervousness-reducing effects of alcohol, the side effects of alcohol withdrawal on alcoholics, and the mediating effects of high body mass on the effects of alcohol: A severely obese alcoholic who is nervous enough about driving and suffering from the shakes might perform worse stone-cold sober than he does with the moderate BAC that he has after drinking a six-pack. What are the odds that there exists at lease one sufficiently obese alcoholic who is nervous about driving? That data point would not provide notable evidence that alcohol improves driving in the general population.
4buybuydandavis
The phrase "There is no evidence that X" is the single best indicator of someone statistically deluded or dishonest.
2A1987dM
I'd normally take “evidence that [clause]” or “evidence for [noun phrase]” to mean ‘(non-negligible) positive net evidence’. (But of course that can still be a lie, or the result of motivated cognition.) If I'm talking about evidence of either sign, I'd say “evidence whether [clause]” or “evidence about [noun phrase]”.
0buybuydandavis
I think your usage is idiosyncratic. People routinely talk about evidence for and against, and evidence for is not the net, but the evidence in favor. It's quite standard to talk about evidence for and against a proposition in exactly this way, as he reports the FDA did. Having talked about "the evidence for" and weighing against the "evidence against", you don't then deny the existence of the "evidence for" just because, in balance, you find the evidence against more convincing. You're slicing the language so thinly, and in such a nonstandard way, it seems like rationalization and motivated reasoning. No evidence means no evidence. No means no. It can mean *very very little too". Fine. But it doesn't mean "an appreciable amount that has a greater countervailing amount".
2Decius
But here the FDA has taken "The balance of the evidence is not enough for to be sure enough" and said "There is no evidence for". The evidence cited as "no evidence" should move the estimate towards 84% certain that there is an effect in the general population.
0buybuydandavis
Very good point. In this case, honest eyeballing of the data would lead one to conclude that there is an effect. There actually isn't any evidence against an effect hypothesis, because they're not testing an effect hypothesis for falsification at all. There just isn't enough evidence against the null by their arbitrarily too high standard. And this is the standard statistical test in medicine, whereby people think they're being rigorously scientific. Still just 2 chromosomes away from chimpanzees.
2TitaniumDragon
This is why you never eyeball data. Humans are terrible at understanding randomness. This is why statistical analysis is so important. Something that is at 84% is not at 95%, which is a low level of confidence to begin with - it is a nice rule of thumb, but really if you're doing studies like this you want to crank it up even further to deal with problems with publication bias. publish regardless of whether you find an effect or not, and encourage others to do the same. Publication bias (positive results are much more likely to be reported than negative results) further hurt your ability to draw conclusions. The reason that the FDA said what they did is that there isn't evidence to suggest that it does anything. If you don't have statistical significance, then you don't really have anything, even if your eyes tell you otherwise.
0buybuydandavis
Some are more terrible than others. A little bit of learning is a dangerous thing. Grown ups eyeball their data and know the limits of standard hypothesis testing. Yeah, evidence that the FDA doesn't accept doesn't exist.
5TitaniumDragon
The people who believe that they are grown-ups who can eyeball their data and claim results which fly in the face of statistical rigor are almost invariably the people who are unable to do so. I have seen this time and again, and Dunning-Kruger suggests the same - the least able are very likely to do this based on the idea that they are better able to do it than most, whereas the most able people will look at it and then try to figure out why they're wrong, and consider redoing the study if they feel that there might be a hidden effect which their present data pool is insufficient to note. However, repeating your experiment is always dangerous if you are looking for an outcome (repeating your experiment until you get the result you want is bad practice, especially if you don't adjust things so that you are looking for a level of statistical rigor that is sufficient to compensate for the fact that you're doing it over again), so you have to keep it very carefully in mind and control your experiment and set your expectations accordingly.
0buybuydandavis
The problem we started with was that "statistical rigor" is generally not rigorous. Those employing it don't know what it would mean under the assumptions of the test, and fewer still know that the assumptions make little sense.
0TitaniumDragon
[quote]Correct. But neither can we say that the dye does not cause hyperactivity in anyone.[/quote] No, but that is not our goal in the first place. Doing a test on every single possible trait is economically infeasible and unreasonable; ergo, net impact is our best metric. The benefit is "we get a new food additive to use". The net cost is zero in terms of health impact (no more hyperactivity in the general population). Ergo, the net benefit is a new food additive. This is very simple math here. Net benefit is what we care about in this case, as it is what we are studying. If it redistributes ailments amongst the population, then there may be even more optimal uses, but we're still looking at a benefit. If you want to delve deeper, that's going to be a seperate experiment. [quote]Your making the claim "no evidence to the contrary" shows that you have not read the literature, have not done a PubMed search on "ADHD, food dye", and have no familiarity with toxicity studies in general. There is always evidence to the contrary. An evaluation weighs the evidence on both sides. You can take any case where the FDA has said "There is no evidence that X", and look up the notes from the panel they held where they considered the evidence for X and decided that the evidence against X outweighed it.[/quote] Your making the claim "evidence to the contrary" suggests that any of this is worth anything. The problem is that, unfortunately, it isn't. If someone does a study on 20 different colors of M&Ms, then they will, on average, find that one of the M&Ms will change someone's cancer risk. The fact that their study showed that, with 95% confidence, blue M&Ms increased your odds of getting cancer, [b]is not evidence for the idea that blue M&M's cause cancer[/b]. Worse, the odds of the negative finding studies being published is considerably less than the probability of the positive finding study being published. This is known as "publication bias". Additionally, people are
-1PhilGoetz
* This is very basic and is absolutely essential to understanding any sort of data of this sort. When I say that there is no evidence for it, I am saying precisely that - just because someone studied 20 colors of M&M's and found that one has a 95% chance of causing more cancer tells me nothing. It isn't evidence for anything. It is entirely possible that it DOES cause cancer, but the study has failed to provide me for evidence of that fact. When I said that "making the claim "no evidence to the contrary" shows that you have not read the literature, have not done a PubMed search on "ADHD, food dye", and have no familiarity with toxicity studies in general," I meant that literally. I'm well-aware of what 95% means and what publication bias means. If you had read the literature on ADHD and food dye, you would see that it is closer to a 50-50 split between studies concluding that there is or is not an effect on hyperactivity. You would know that some particular food dyes, e.g., tartrazine, are more controversial than others. You would also find that over the past 40 years, the list of food dyes claimed not to be toxic by the FDA and their European counterparts has been shrinking. If you were familiar with toxicity studies in general, you would know that this is usually the case for any controversial substance. For instance, the FDA says there is "no evidence" that aspartame is toxic, and yet something like 75% of independent studies of aspartame concluded that it was toxic. The phrase "no evidence of toxicity", when used by the FDA, is shorthand for something like "meta-analysis does not provide us with a single consistent toxicity narrative that conforms to our prior expectations". You would also know that toxicity studies are frequently funded by the companies trying to sell the product being tested, and so publication bias works strongly against findings of toxicity.
0Decius
Suppose their exists a medication that kills 10% of the rationalists who take it (but kills nobody of other thought patterns), and saves the lives of 10% of the people who take it, but only by preventing a specific type of heart disease that is equally prevalent in rationalists as in the general population. A study on the general population would show benefits, while a study on rationalists would show no effects, and a study on people at high risk for a specific type of heart disease would show greater benefits. Food dye is allegedly less than 95% likely to cause hyperactivity in the general population. It has been alleged to be shown that it is more than 95% likely to cause hyperactivity in specific subgroups. It is possible for both allegations to be true.
0TitaniumDragon
Yes, but it is not a probable outcome, as for it to be true, it would require a counterbalancing group of people who benefit from it or for the subgroups to be extremely small; however, the allegations are that the subgroups are NOT small enough that the effect could have been hidden in this manner, suggesting that there is no effect on said subgroups as the other possibility is unlikely.
0Decius
Strictly speaking, the subgroup in question only has to be one person smaller than everybody for those two statements to be compatible. Suppose that there is no effect on 10% of the population, and a consistent effect in 90% of the population that just barely meets the p<.05 standard when measured using that subgroup. If that measurement is make using the whole population, p>.05
0TitaniumDragon
95% is an arbitrarily chosen number which is a rule of thumb. Very frequently you will see people doing further investigation into things where p>0.10, or if they simply feel like there was something interesting worth monitoring. This is, of course, a major cause of publication bias, but it is not unreasonable or irrational behavior. If the effect is really so minor it is going to be extremely difficult to measure in the first place, especially if there is background noise.
0Decius
It's not a rule of thumb; it's used as the primary factor in making policy decisions incorrectly. In this specific example, the regulatory agency made the statement "There is no evidence that artificial colorings are linked to hyperactivity" based on the data that artificial colorings are linked to hyperactivity with p~.13 There are many other cases in medical where 0.05p<.5 is used as evidence against p.

I've similarly griped here in the past about the mistaken ways medical tests are analyzed here and elsewhere, but I think you over complicated things.

The fundamental error is misinterpreting a failure to reject a null hypothesis for a particular statistical test, a particular population, and a particular treatment regime as a generalized demonstration of the null hypothesis that the medication "doesn't work". And yes, you see it very often, and almost universally in press accounts.

You make a good point about how modeling response = effect + error leads to confusion. I think the mistake is clearer written as "response = effect + noise", where noise is taken as a random process injecting ontologically inscrutable perturbations of the response. If you start with the assumption that all differences from the mean effect are due to ontologically inscrutable magic, you've ruled out any analysis of that variation by construction.

If that meant the same thing, then so would these claims

OK, I may be dense today, but you lost me there. I tried to puzzle out how the raven sentences could be put symbolically so that they each corresponded to one of the negations of your original logic sentence, and found that fruitless. Please clarify?

The rest of the post made sense. I'll read through the comments and figure out why people seem to be disagreeing first, which will give me time to think whether to upvote.

2Vaniver
First, we start with the symbolic statement: Next, we replace the variables with English names: Next, we replace the symbols with English phrases: Then we clean up the English: We can repeat the process with the other sentence, being careful to use the same words when we replace the variables: becomes becomes and finally: (I should note that my English interpretation of ∃y P(x,y) is probably a bit different and more compact than PhilGoetz's, but I think that's a linguistic rather than logical difference.)
3PhilGoetz
You certainly gave me the most-favorable interpretation. But I just goofed. I fixed it above. This is what I was thinking, but my mind wanted to put "black(x)" in there because that's what you do with ravens in symbolic logic. A) Not everything is a raven: !∀x raven(x) B) Everything is not a raven: ∀x !raven(x)
2Vaniver
The new version is much clearer. My interpretation of the old version was that y was something like "attribute," so you could say "Not every black thing has being a raven as one of its attributes" or "for every black thing, it does not have an attribute which is being a raven." Both of those are fairly torturous sentences in English but the logic looks the same.
0Morendil
That's where I don't follow. I read the original sentence as "for every x there is an y such that the relationship P obtains between x and y". I'm OK with your assigning "black things" to x but "raven-nature" needs explanation; I don't see how to parse it as a relationship between two things previously introduced. The edited version makes more sense to me now.
-1PhilGoetz
You're right! I goofed on that example. I will change it to a correct example.

If 11 out of 11 children studied have a property (no food coloring hyperactivity response), that's a bit stronger than "there exist 11 children with this property", though perhaps not quite "all children have this property".

7PhilGoetz
That's not how it works. You measure the magnitude of an effect, then do a statistical test of the hypothesis that all of the children have a response, which gives a cutoff that the effect magnitude must reach to accept that hypothesis with 95% confidence. If only 10% of the children have such a response, you won't reach that cutoff. If 10% have a positive response and 10% have a negative response, you will detect nothing, no matter how big your sample is.

You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.

Or rather, you can conclude that, if there were no effect of food dye on hyperactivity and we did this test a whole lotta times, then we'd get data like this 16% of the time, rather than beneath the 5%-of-the-time maximum cutoff you were hoping for.

It's not so easy to jump from frequentist confidence intervals to confidence for or against a hypothesis. We'd need a bunch of assumptions. I don't have access to the original article so I'll just make ... (read more)

-4PhilGoetz
All you're saying is that studies should use Bayesian statistics. No medical journal articles use Bayesian statistics. Given that the frequentist approach behind these tests is "correct", the article's claim is incorrect. The authors intended to use frequentist statistics, and so they made an error. If a weak default complexity prior of 1 nat for 1 extra variable wipes out 84% confidence, that implies that many articles have incorrect conclusions, because 95% confidence might not be enough to account for a one-variable complexity penalty. In any case, you are still incorrect, because your penalty cannot prove that the null hypothesis is correct. It can only make it harder to prove it's incorrect. Failure to prove that it is incorrect is not proof that it is correct. Which is a key point of this post.
2selylindi
Nah, they're welcome to use whichever statistics they like. We might point out interpretation errors, though, if they make any. Under the assumptions I described, a p-value of 0.16 is about 0.99 nats of evidence which is essentially canceled by the 1 nat prior. A p-value of 0.05 under the same assumptions would be about 1.92 nats of evidence, so if there's a lot of published science that matches those assumptions (which is dubious), then they're merely weak evidence, not necessarily wrong. It's not the job of the complexity penalty to "prove the null hypothesis is correct". Proving what's right and what's wrong is a job for evidence. The penalty was merely a cheap substitute for an informed prior.

I think part of the problem is that there is a single confidence threshold, usually 90%. The problem is that setting the threshold high enough to compensate for random flukes and file drawer effects causes problems when people start interpreting threshold - epsilon to mean the null hypothesis has been proven. Maybe it would be better to have two thresholds with results between them interpreted as inconclusive.

-1PhilGoetz
That is part of the problem. If it weren't for using a cutoff, then it would be the case that "proving" "! forall X P(X)" with high confidence would be evidence for "for many X, !P(X)", as several of the comments below are claiming. But even if they'd used some kind of Bayesian approach, assuming that all children are identical would still mean they were measuring evidence about the claim "X affects all Y", and that evidence could not be used to conclusively refute the claim that X affects some fraction of Y. Using a cutoff, though, isn't an error. It's a non-Bayesian statistical approach that loses a lot of information, but it can give useful answers. It would be difficult to use a Bayesian approach in any food toxicity study, because setting the priors would be a political problem. They did their statistical analysis correctly.
[-]Cyan50

This post makes a point that is both correct and important. It should be in Main.

[pollid:424]

This post makes a point that is both correct and important. It should be in Main.

This post makes a point that is both correct and important. A post that makes this point should be in Main.

The reception of this post indicates that the desired point is not coming through to the target audience. That matters.

7Kawoomba
No it doesn't. It takes the word "all" as used in everyday language and pretends it is intended to be precisely the same as the logical "all" operator, which it of course it is not. It's the worst kind of nitpicking, the kind of "all people who have a heart attack should go to a licensed hospital" - "nuh-uh, not if the hospital is on fire / not if they are billionaires with a fully equipped medical team in their attic". What on Earth is "important" about such a point?

Not even that. It takes the zero-article plural as used in everyday language and pretends it is intended to be precisely the same as the logical "all" operator, which it of course it is not.

-2Kawoomba
But ... but ... Science?
0[anonymous]
They tend to be used either for keeping crows from eating your crops or making rivals look bad by misrepresenting them.
-6PhilGoetz
4buybuydandavis
This post makes a point that is both correct and important, but Phil has clearly lost much of the audience and is ticked off besides, and I don't blame him. I think we've got two issues. The general issue of how one tests a null hypothesis and what it does and does not mean to reject the null, and the particular issue of food dyes. The general issue seems important, while the particular could provide a helpful illustration of the general. But I would think that someone else, and probably multiple someone's, have already plowed this ground. Jaynes must have an article on this somewhere. Anyone got a good article?
6gwern
Depends on what you want. You could probably get something useful out of my http://lesswrong.com/lw/g13/against_nhst/ collection.
0buybuydandavis
Thanks. Interesting, but it doesn't really get at the heart of the problem here, of mistaken interpretation of a "failure to reject" result as confirmation of the null hypothesis, thereby privileging the null. That just shouldn't happen, but often does. I saw the Gigerenzer 2004 paper (you're talking about the Null Ritual paper, right?) earlier today, and it rang a few bells. Definitely liked the chart about the delusions surrounding p=0.01. Appalling that even the profs did so poorly. GG has another 2004 paper with a similar theme: The Journal of Socio-Economics 33 (2004) 587–606 Mindless statistics http://people.umass.edu/~bioep740/yr2009/topics/Gigerenzer-jSoc-Econ-1994.pdf
2gwern
Isn't that a major criticism of NHST, that almost all users and interpreters of it reverse the conditionality - a fallacy/confusion pointed by Cohen, Gigerenzer, and almost every paper I cited there?
2Kindly
I think that's a separate mistake. This paper shows Pr[data|H0] > 0.05. The standard mistake you refer to switches this to falsely conclude Pr[H0|data] > 0.05. However, neither of these is remotely indicative of H0 being true.
3[anonymous]
Thanks; I was trying to write a comment that said the same thing, but failed to do so.
8[anonymous]
disagree because not correct. Phil's logical interpretation procedure would call shenanigans whether or not the statistical reasoning was correct. The whole point of statistics is that it can tell us things logic cannot. If there is an important point to be made here, it needs to be made with a statistical analysis, not a logical one.
5Eugine_Nier
Logical analysis is a limiting case of statistical analysis, thus problems with logical reasoning have corresponding problems with statistical reasoning. I agree that Phil should have spelled out this distinction explicitly.
-5PhilGoetz

Does what you're saying here boil down to "failing to reject the null (H0) does not entail rejecting the alternative (H1)"? I have read this before elsewhere, but not framed in quantifier language.

7Cyan
No, it's more subtle than that. I think it's more clearly stated in terms of effect sizes. (Down with null hypothesis significance testing!) The study measured the average effect of food dye on hyperactivity in the population and showed it was not distinguishable from zero. The quoted conclusion makes the unfounded assumption that that all children can be characterized by that small average effect. This ignores unmeasured confounders, which another way of phrasing PhilGoetz's correct (CORRECT, PEOPLE, CORRECT!) point. The document I linked mentions doing a "sensitivity analysis for the possibility of unmeasured confounding, to see the sorts of changes one could expect if there were such a confounder." In the above study (assuming PhilGoetz described it correctly; I haven't read the original paper), the data permitted such a sensitivity analysis. It would have given an upper bound for the effect of the unmeasured confounder as a function of an assumed prevalence of the confound in the population. (A smaller assumed prevalence gives a larger upper bound.) But if you don't even notice that it's possible for children to have heterogeneous responses to the treatment, you'll never even think of doing such a sensitivity analysis.
2PhilGoetz
Yes, but I also spelled out why I think they're making that mistake. They're trying to claim the authority of logic, but not being rigorous and forgetting that logical statements shouldn't contain unquantified variables.

I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.

Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:

  • Only a small fraction of the children were affected by the intervention.

  • Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direct

... (read more)
5PhilGoetz
Specifically, it cannot be the case, with 95% certainty, that all children are made hyperactive. That is exactly what they proved with their F-tests (though if you look at the raw data, the measures they used of hyperactivity conflicted with each other so often that it's hard to believe they measured anything at all). They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive. Yet they claimed, as I quoted, that they proved that no children are made hyperactive. It's a logic error with large consequences in healthcare and in other domains. You're correct that the study data is useful and rules out some possibilities. But the claim they made in their summary is much stronger than what they showed.
5Kindly
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.) In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we're testing if the means are different. If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don't care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study. And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let's recap: * If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don't falsify this hypothesis. However, we also don't care about effects that are uniformly weak. * If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying "strong" lets us quantify "small". * We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
1PhilGoetz
This is math. You can't say "If 2+2 = 4, then 2+1.9 = 4." There is no "as strongly" being reported here. There is only accept or reject. The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one. It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that. The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
2Kindly
I don't think we actually disagree. Edit: Nor does reality disagree with either of us.
-1PhilGoetz
You're correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you'll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn't the same thing, so it's still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
0Kindly
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can't possibly disprove a hypothesis such as "the means of these two distributions are different". There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already: And of course it's legitimate to give up at this step and say "the null hypothesis has not been rejected, so we have nothing to say". But if we don't do this, then our only recourse is to say something like: "with 95% certainty, the difference in means is less than X". In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we're a bit less certain that 2 + 1.9 is less than 4, as well. Incidentally, is there some standard statistical test that produces this kind of output?

When people do studies of the effects of food coloring on children, are the children blindfolded?

That is, can the studies discern the neurochemical effects of coloring molecules from the psychological effects of eating brightly-colored food?

I expect that beige cookies are not as exciting as vividly orange cookies.

My read of the Mattes & Gittelman paper is that they're comparing natural and artificial food coloring.

Moreover, no type of rater (parents, teachers, psychiatrists, nor children) guessed beyond chance the type of cookie.

The tests compute what a difference in magnitude of response such that, 95% of the time, if the measured effect difference is that large, the null hypothesis (that the responses of all subjects in both groups were drawn from the same distribution) is false.

I think that should be: The tests compute what is a difference in magnitude of response such that, if the null hypothesis is true, then 95% of the time the measured effect difference will not be that large.

Frequentist statistics cannot make the claim that with some probabilty the null hypothesis is true or false. Ever. You must have a prior and invoke Bayes theorem to do that.

I'm not as interested in proving my point, as in figuring out why people resist it so strongly. It seems people are eager to disagree with me and reluctant to agree with me.

How did the post make you feel, and why?

8buybuydandavis
It's not just their feelings, it's their priors. I've found previously that many people here are extremely hostile to criticisms of the statistical methods of the medical establishment. It's extremely odd at a site that puts Jaynes on a pedestal, as no one rants more loudly and makes the case clearer than Jaynes did, but there it is. But consider if you're not a statistician. You're not into the foundations of statistical inference. You haven't read Jaynes. Maybe you've had one semester of statistics in your life. When you're taught hypothesis testing, you're taught a method. That method is statistics. There's no discussion about foundations. And you look at medical journals. This is how they do it. This is how science is done. And if you're a rationalist, you're on Team Science. Elsewhere in the thread, there are links to a Gigerenzer paper showing how statistics students and their professors are making fundamental errors in their interpretations of the results of confidence interval testing. If stat professors can't get it right, the number of people who have any notion that there is a possibility of an issue is vanishingly small. Higher here than usual, but still a minority. Meanwhile, you've shown up and attacked Team Science in general in medicine. To add the cherry on top, you did it in the context of exactly the kind of issue that Team Science most rants about - some "anecdotal hysteria" where parents are going ballistic about some substance that is supposedly harming their precious little lumpkins. But everyone knows there is nothing wrong with food dies. They've been around for decades. The authorities have tested them and declared them safe. There is no evidence for the claim, and it's harmful to be drumming up these scares. My PhD was in EE, doing statistical inference in a machine learning context. Rather unsophisticated stuff, but that's something of the point. Science is not an integrated whole where the best thoughts, best practices, and best met
[-]gwern180

I've found previously that many people here are extremely hostile to criticisms of the statistical methods of the medical establishment. It's extremely odd at a site that puts Jaynes on a pedestal, as no one rants more loudly and makes the case clearer than Jaynes did, but there it is.

(Eliezer does, anyway. I can't say I see very many quotes or invocations from others.)

I am hostile to some criticisms, because in some cases when I see them being done online, it's not in the spirit of 'let us understand how these methods make this research fundamentally flawed, what this implies, and how much we can actually extract from this research'*, but in the spirit of 'the earth is actually not spherical but an oblate spheroid thus you have been educated stupid and Time has Four Corners!' Because the standard work has flaws, they feel free to jump to whatever random bullshit they like best. 'Everything is true, nothing is forbidden.'

* eg. although extreme and much more work than I realistically expect anyone to do, I regard my dual n-back meta-analysis as a model of how to react to potentially valid criticisms. Instead of learning that passive control groups are a serious methodological iss... (read more)

0buybuydandavis
Put Jaynes on a pedestal, you mean? Hmmmn. I had that problem before with Korzybbski. I saw The Map is Not the Territory, and assumed people were familiar with Korzybski and his book Science and Sanity. Turns out even Eliezer hadn't read it, and he got his general semantics influence from secondary writers such as Hayakawa. I've only found 1 guy here with confirmed knowledge of a breadth of general semantics literature. How many people do you think have read substantial portions of Jaynes book? Have you?
4gwern
Yes. To put it one way, a site search for 'Jaynes' (which will hit other people sometimes discussed on LW, like the Bicameral Mind Jaynes) turns up 718 hits; in contrast, to name some other statisticians or like-minded folks - 'Ioannidis' (which is hard to spell) turns up 89 results, 'Cochrane' 57, 'Fisher' 127, 'Cohen' 193, 'Gelman' 128, 'Shalizi' 109... So apparently in the LW-pantheon-of-statisticians-off-the-top-of-my-head, Jaynes can barely muster a majority (718 vs 703). For someone on a pedestal, he just isn't discussed much. Most of those in the book reading clubs fail, he is rarely quoted or cited... <5%. I bought a copy (sitting on my table now, actually), read up to chapter 4, some sections of other chapters that were interesting, and concluded that a number of reviews were correct in claiming it was not the best introduction for a naive non-statistician. So I've been working through other courses/papers/books and running experiments and doing analyses of my own to learn statistics. I do plan to go back to Jaynes, but only once I have some more learning under my belt - the Probabilistic Graphical Models Coursera is starting today, and I'm going to see if I can handle it, and after that I'm going to look through and pick one of Kruschke's Doing Bayesian Data Analysis, Sivia's Data Analysis: A Bayesian Tutorial, Bolstad's Introduction to Bayesian Statistics, and Albert's Bayesian Computation with R. But we'll see how things actually go.
0ChristianKl
The problem is that it's very hard to change your world view or even to coherently understand the worldview of someone else. Understanding that you might be wrong about things you take for granted is hard. Among new atheists even the notion that the nature of truth is up for discussion is a very threatening question. Even if they would read Jaynes from cover to cover, they take the notion of truth they learned as children for granted and don't think deeply about where Jaynes notion of truth differs from their own. The discussion about Bayesianism with David Chapman illustrates how he and senior LW people didn't even get clear about the points on which they disagree.
0buybuydandavis
I don't know if it's threatening, and I doubt that it applies to Dennett, but the other guys can't seem to even conceive of truth beyond correspondence. But if it's a matter of people being open to changing their world view, to even understanding that they have one, and other people have other world views, it's Korzybski they need to read, not Jaynes. The guy with the blog is Chapman? I don't see a discussion. I see a pretty good video, and blog comments that I don't see any value at all in. I had characterized them more colorfully, but seeing that Chapman is on the list, I decided to remove the color. I'm not trying to be rude here, but his comments are just very wrong about probability, and thereby entirely clueless about the people he is criticizing. As an example No! Probability as inference most decidedly is not "just arithmetic". Math tells you nothing axiomatically about the world.. All our various mathematics are conceptual structures that may or may not be useful in the world. That's where Jaynes, and I guess Cox before him, adds in the magic. Jaynes doesn't proceed axiomatically. He starts with problem of representing confidence in a computer, and proceeds to show how the solution to that problem entails certain mathematics. He doesn't proceed by "proof by axiomatic definitions", he shows that the conceptual structures work for the problem attacked. Also, in Jaynes presentation of probability theory as an extension of logic, P(A|B) isn't axiomatically defined as P(AB)/P(B), it is the mathematical value assigned to the plausibility of a proposition A given that proposition B is taken to be true. It's not about counting, it's about reasoning about the truth of propositions given our knowledge. I guess if he's failing utterly to understand what people are talking about, what they're saying might look like ritual incantation to him. I'm sure it is for some people. Is there some reason I should take David Chapman as particularly authoritative? Why do
0ChristianKl
Because senior LW people spent effort in replying to him. The post lead to LW posts such as what bayesianism taught me. Scott Alexander wrote in response: on first looking into chapmans pop-bayesianism. Kaj Sotala had a lively exchange in the comments of that article. I think in total that exchange provides a foundation for clearing the question of what Bayesianism is. I do consider that an important question. As far as authority goes David Chapman did publish academic papers about artificial intelligence. He did develop solutions for previously unsolved AI problems. When he says that there's no sign of Bayes axiom in the code that he used to solve an AI problem he just might be right. Dennett is pretty interesting. Instead of asking what various people mean when they say consciousness he just assumes he knows and declares it nonexistent. The idea that maybe he doesn't understand what other people mean with the term doesn't come up in his thought. Dennett writes about how detailed visual hallucinations are impossible. I do have had experiences where what I visually perceived didn't change much whether or not I closed my eyes. It was after I spent 5 days in artificial coma. I know two additional people who I meet face to face who have had similar experiences. I also have access to various accounts of people hallucinating stuff in other context via hypnosis. My own ability let myself go is unfortunately not good, so I still lack some first hand accounts of some other hallucinations. A week ago I spoke at our local LW meetup with someone who said that while "IQ" obviously exists "free will" obviously doesn't. At that point in time I didn't know exactly how to resolve the issue but it seems to me that those are both concept that exist somehow on the same level. You won't find any IQ atoms and you won't find any free will atoms but they are still mental concepts that can be used to model things about the real world. That a problem that arises by not having a well
0buybuydandavis
I think you're being unfair to Dennett. He actually has availed himself of the findings of other fields, and has been at the consciousness shtick for decades. He may not agree, but it's unlikely he is unaware. And when did he say consciousness was nonexistent? Cite? That seems a rather odd thing for him to say, and not particularly in his ideological interests. Cite here? Again, except for supernatural bogeymen, my experience of him is that he recognizes that all sorts of mental events exists, but maybe not in the way that people suppose. Not accurate. If those things don't fit in their world views, they don't exist in them, so they're not pretending. On a general brouhaha with CHapman, I seemed to miss most of that. He did one post on Jaynes and A_p, which I read as I've always been interested in that particular branch of Jaynes' work. But the post made a fundamental mistake, IMO, and the opinion of others, and I think Chapman admitted as much before all of his exchanges were over. So even with Chapman running the scoreboard, he's behind in points.
0Jayson_Virissimo
Well, for one thing, Chapman was (at least at one point) a genuine, credentialed AI researcher and a good fraction of content on Less Wrong seems to be a kind of armchair AI-research. That's the outside view, anyway. The inside view (from my perspective) matches your evaluation: he seems just plain wrong.
0buybuydandavis
I think a few people here are credentialed, or working on their credentials in machine learning. But almost everything useful I learned, I learned by just reading the literature. There were three main guys I thought had good answers - David Wolpert, Jaynes, and Pearl. I think time has put it's stamp on approval on my taste. Reading more from Chapman, he seems fairly reasonable as far as AI goes, but he's got a few ideological axes to grind against some straw men. On his criticisms of LW and Bayesianism, is there anyone here who doesn't realize you need algorithms and representations beyond Bayes Rule? I think not too long ago we had a similar straw man massacre where everyone said "yeah, we have algorithms that do information processing other than Bayes rule - duh". And he really should have stuck it out longer in AI, as Hinton has gone a long way to solving the problem Chapman thought was insurmountable - getting proper representation of the space to analyze from the data without human spoon feeding. You need a hidden variable model of the observable data, and should be able to get it from prediction of subsets of the observables using the other observables. That much was obvious, it just took Hinton to find a good way to do it. Others are coming up with generalized learning modules and mapping them to brain constructs. There was never any need to despair of progress.
-2buybuydandavis
But you don't have a complete fossil record, therefore Creationism! Obviously that's a problem. This somewhat confirms my comment to Phil, that linking the statistical issue to food dyes made reception of his claims harder as it better fit your pattern than a general statistical argument. But from the numbers he reported, the basic eyeball test of the data leaves me thinking that food dyes may have an affect. Certainly if you take the data alone without priors, I'd conclude that more likely than not, food dyes have an effect. That's how I would interpret the 84% significance threshold - probably there is a difference. Do you agree? Unfortunately, I don't have JAMA access to the paper to really look at the data, so I'm going by the 84% significance threshold.
2PhilGoetz
I made up the 84% threshold in my example, to show what can happen in the worst case. In this study, what they found was that food dye decreased hyperactivity, but not enough to pass the threshold. (I don't know what the threshold was or what confidence level it was set for; they didn't say in the tables. I assume 95%.) If they had passed the threshold, they would have concluded that food dye affects behavior, but would probably not have published because it would be an embarrassing outcome that both camps would attack.
2buybuydandavis
To be clear, then, you're not claiming that any evidence in the paper amounts to any kind of good evidence that an effect exists? You're making a general argument about the mistaken conclusion of jumping from "failure to reject the null" to a denial that any effect exists.
2PhilGoetz
Yes, I'm making a general argument about that mistaken conclusion. The F-test is especially tricky, because you know you're going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That's a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can't say whether failure of an F-test is evidence for or against the hypothesis.
0buybuydandavis
I'd add to the basic statistical problem the vast overgeneralization and bad decision theory. You hit on one part of that, the generalization to the entire population. People are different. But even if they're the same, U shaped response curves make it unlikely to find a signal. - you have to have the goldilocks amount to show an improvement. People vary over time. going in and out of the goldilocks range. So you when you add something, you'll be pushing some people into the goldilocks range, and some people out. It also comes from multiple paths to the same disease. A disease is a set of observable symptoms, not the varying particular causes of the same symptoms. Of course it's hard to find the signal in a batch of people clustered into a dozen different underlying causes for the same symptoms. But the bad decisions theory is the worst part, IMO. If you have a chronic problem, a 5% chance of a cure from a low risk, low cost intervention is great. But getting a 5% signal out of black box testing regimes biased against false positives is extremely unlikely, and the bias against interventions that "don't work" keeps many doctors from trying perfectly safe treatments that have a reasonable chance of working. The whole outlook is bad. It shouldn't be "find me a proven cure that works for everyone". It should be "find me interventions to control the system in a known way." Get me knobs to turn, and let's see if any of the knobs work for you.
2gwern
I believe Knight posted links to fulltext at http://lesswrong.com/lw/h56/the_universal_medical_journal_article_error/8pne I haven't looked but I suspect I would not agree and that you may be making the classic significance misinterpretation.
2Eugine_Nier
I think the problem is that you talked about statistics in a hand-wavy way and as a result people misunderstood you. It also didn't help that the way you interpreted the logical structure of the paper ignored the standard steelmanning of frequentist papers.
1PhilGoetz
What do you mean by your second sentence?
3Eugine_Nier
For example, in messy topics like biology, most instances of "all" should be replaced with "most". In other words, people were translating the universal statements into probabilistic statements. They were subsequently confused when you insisted on treating the problem as logical rather than statistical.
6drethelin
This seems to be a very common nerd argument failure mode.
3Eugine_Nier
What is the antecedent of "this"? This isn't a rhetorical question, I honestly can't figure out which of several possibilities you're referring to.
4drethelin
responding to claims as if they are meant literally or arguments as if they're deductive logical arguments.
-2PhilGoetz
It is because it is a statistical problem that you can't replace "all" with "most". The F-value threshold was calculated assuming "all", not "most". You'd need a different threshold if you don't mean "all". Also, the people I am complaining about explicitly use "all" when they interpret medical journal articles in which a test for an effect was failed as having proven that the effect does not exist for any patients.
2Kindly
I'm not sure if I was included among "people" but in retrospect it seems like I was simply doing this. I'm sorry.
1[anonymous]
I skimmed over the initial first-order logic bits, was bored by the statistics bits, and came away feeling like the conclusion was pedantic and mostly well-known among my peers.
1PhilGoetz
Thank you for the answer. Now we know (from the many upvoted disagreements in the comments, and the downvoting of every comment I made correcting their mistakes) that the conclusion is not well-known among readers of LessWrong.
[-]leiz10

I think that the universal quantifier in

!( ∀child ( eats(child, coloring) ⇨ hyperactive(child) ) )

is not appropriate.

The original statement

artificial food coloring causes hyperactivity in all children.

only implicates that artificial food coloring was responsible for all children's hyperactivity, not that children who ever ate artificial food coloring would inevitably have hyperactivity. So the formula without universal quantifier is more reasonable and thus the final statement of the article is without problem.

-3PhilGoetz
No, you are picking up on the lack of details about time that I mentioned. You really don't want me to write a proposition incorporating the time relationship between when cookies were eaten and when and how behavior was measured. The formula without a quantifier wouldn't even be well-formed. It would have no meaning.
3leiz
OK I agree that the word 'inevitably' is ambiguous. Regardless of the accuracy of the literal-to-logical translations, I think the reason the logical expression of the statement of the article does not match that of the final conclusion, as your logical reasoning proves, is that the writer were not doing the very logical reasoning but doing medical research and thus proposing something new, not something of equivalent logical consequences. Their first statement: only implies that they did not buy the hypothesis, which did not necessarily imply that they accepted the negation of the hypothesis, which corresponds to your first formula: equivalently: Even though they actually accepted the negation of the hypothesis, that is to say , accepted your first formula, the final conclusion they got through the research is that: whose correspondent logical expression is your second formula: This formula seems stronger than the first one: From my point of view, I don't think that the medical researchers were intentionally arbitrarily generalizing their results or just making logical mistakes. They just posed an attitude to an existing hypothesis and give a new one through the article, in this case, a new stronger hypothesis (the word 'stronger' depends on whether they actually just negated the original hypothesis). I think their only fault is that they failed to keep their own medical views in one article logically identical.
[-]Zian00

It would've been very helpful if some sort of glossary or even a Wikipedia link was provided before diving into the use of the notational characters such as those used in "∀x !P(x)".

Although this post covers an important topic, the first few sentences almost lost me completely, even though I learned what all those characters meant at one time.

And, as LessWrong is rather enamored with statistics, consider that by writing P(x,y), the readers have an exactly 50% chance of getting the opposite meaning unless they have very good recall. :)

This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.

That would be interesting if true. I recommend finding another one, since you sya they're so plentiful. And I also recommend reading it carefully, as the study you chose to make an example of is not the study you were looking for. (If you don... (read more)

Why is this post not visible under "Main->New"?

4Kawoomba
The post has been moved to Discussion, I don't know by whom. Edit: My guess would be EY or Vladimir Nesov. Edit2: Back to Main we go. The Wheels on the Bus Go Round and Round ... Edit3: Back to Discussion. Edit4: Back to Main. Whodunnit? Edit 5: Deleted! Wait ... sighting confirmed, back in Discussion! Is there no stop condition? Is it a ghost? Is it unkillable?
-3PhilGoetz
By whom? Okay, folks. Now you're just being assholes. Why are you downvoting me for asking who moved my post?
7wedrifid
You have not submitted to their social pressure in other parts of this thread. This offends people (for social-politically rational reasons). They will now attack just about anything you write in this context. You could write the equations for a intuitive and correct physical Theory of Everything and you would still get (some) downvotes. Note that calling them assholes makes it easier for them and seems to benefit you not in the slightest.

Unfortunately, there's an error in your logic: You call that type of medical journal article error "universal", i.e. applicable in all cases. Clearly a universal quantifier if I ever saw one.

That means that for all medical journal articles, it is true that they contain that error.

However, there exists a medical journal article that does not contain that error.

Hence the medical journal error is not universal, in contradiction to the title.

First logical error ... and we're not even out of the title? Oh dear.

4Dan_Moore
Perhaps a clearer title would have been 'A Universal Quantifier Medical Journal Article Error'. Bit of a noun pile, but the subject of the post is an alleged unjustified use of a universal quantifier in a certain article's conclusion. By the way, I think PhilGoetz is 100% correct on this point - i.e., upon failure to prove a hypothesis using standard frequentist techniques, it is not appropriate to claim a result.
0[anonymous]
Oh come on.