srdiamond comments on Simpson's Paradox - Less Wrong

68 Post author: bentarm 12 January 2011 11:01PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread. Show more comments above.

Comment author: bentarm 14 January 2011 03:12:46AM 1 point [-]

Well, an important question to ask is how the data were generated. If the only thing we know about each patient is whether they were male or female and whether they were born under a Fire sign, and being born under a Fire sign seems to have some explanatory power, then by all means go for it. As Dave suggests below - it is perfectly possible that the astrological data is hiding some genuine phenomenon.

However, if someone collected every possible piece of astrological data, and tried splitting the patients along every one of the 2^11 possible partitions of the twelve starsigns, you would not be surprised to find that at least one of them displayed this sort of behaviour.

I think the key message is that you shouldn't be making causal inferences from correlational conclusions unless you have some good reason to do so.

Comment author: [deleted] 14 January 2011 05:33:00AM 2 points [-]

The urge to infer causation from correlation must be powerful. We can easily spot errors of unwarranted causal inferences, apparently from overtraining the recognition of certain patterns, but as soon as the same caveat is expressed in a novel way, we have to work to apply the principle to novelties of form. Simpson's Paradox seems not just the bearer of the message that you shouldn't make automatic causal inferences from mere correlation; it is an explanation of why that inference is invalid.. A blind correlation 1) doesn't screen out confounds, and 2) might screen out the causal factor.

It seems that we've learned part 1 well, but the complete explanation for the possibility that correlations hide causes includes part 2. It seems part 2 is harder. While we've all learned to spot instances of part 1, we still founder on part 2. We're inclined to think partitioning the data can't make the situation epistemically worse, but it can by screening out the wrong variable, that is, the causal variable.

So in the real life example, we don't find it so counter-intuitive that data about the success rates of men and women fail to prove discrimination when you don't control for the confounds. But we do stumble when it goes the other way. If we had the data that women do better than men for the competitive petitions as well as the easy positions, we continue to find it hard to see that this doesn't prove that women overall don't do better than men.