Understanding Simpson's Paradox

Vaniver

An article by Judea Pearl, available here. It's quick at 8 pages, and worth reading if you enjoy statistics (though I think people who already are familiar with the math of causality¹ will get more out of it than others²). I'll talk here about the part that I think is generally interesting:

Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson's paradox and, naturally, will proceed to convince readers that the paradox deserves the title "resolved."

I've never really liked the name "paradox," because what it seems to mean is "unintuitive phenomenon." (Wikipedia puts it as "something which seems false and yet might be true.") The trouble is that "unintuitive" is a two-place word, and it makes sense to think like reality, so that true things seem true to you, instead of still seeming false. (For example, when I first learned about Zeno's Paradox, I already knew calculus, and so Zeno's position was the one that seemed confusing and false.)

What I like most about Pearl's article is that it explicitly recognizes the importance of fully dissolving the paradox,³ and seems to do so. Simpson's Paradox isn't an unsolvable problem in statistics, it's a straightforward reversal effect--only if you use the language of causality.

1. My review of Causality gives a taste of what it would look like to be familiar with the math, but you'd need to actually read the book to pick it up. The Highly Advanced Epistemology 101 for Beginners sequence is relevant, and contains Eliezer's attempt to explain the basics of causality in Causal Diagrams and Causal Models.

2. Pearl discusses how you would go about using simulations to show that do calculus gives you the right result, but leaves it as an exercise for the reader.

3. How An Algorithm Feels From Inside is probably a better place to start than Dissolving the Question, and I can't help but echo a question from it: "So what kind of math design corresponds to [Simpson's Paradox]?"

See also: bentarm's explanation of Simpson's Paradox.

Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson's paradox and, naturally, will proceed to convince readers that the paradox deserves the title "resolved."

2. Pearl discusses how you would go about using simulations to show that do calculus gives you the right result, but leaves it as an exercise for the reader.

See also: bentarm's explanation of Simpson's Paradox.

I don't think I ever said anything about relative comparisons between correlation coefficients (namely just overall correlation is positive).

The main line I'm thinking of is:

the data is telling a very simple story, which is that A) blue men are more educated and B) more educated people get paid more.

I don't think this story quite captures the data, because I can construct a model where both of these are true but we don't get this effect. If you have the same link between income and education for each group conditioned on knowing group membership (and a net positive relationship without knowing group membership), but you have the blue group mean only to the right (i.e. more educated) than the green group mean, then you don't have this effect because equal education lines don't have blues earning more than greens (they earn less; this is a straightforward 'discrimination against blues' story).

I would use the language of B to mean "in the three node model which has color, education, and income, the direct effect of education on income is positive," which does not appear to be true in the graphs, which look like they could be generated from a E<-C->I model. While it could also be used to mean "in the two node model which has education and income, the direct effect of education on income is positive," that seems unnatural in a case where you know that the link from E to I is spurious (that is, it flows backwards along the causal path from C to E; changing your level of education can't change your color). But this could just be me expecting an unreasonable level of precision from your wording, since the straightforward interpretation, though unnatural, does fit the data (although I think it reduces the strength of the "this doesn't show discrimination" claim, because it does show that what looked like a virtuous education-income link is now a questionable color-income link).

I don't quite see the difference in interpretation from this writing.

It's very possible I've imagined the difference / misunderstood what you've written. My appreciation of the filtering effect of the slices is also very recent, and I may think it's more important as I think about it more.

It seems that I'm quick to jump to a graphical model with nodes that captures the effects between these groups, and want to keep direct, indirect, and total effects separate. I think that's why I see the fundamental broad mechanism here as a reversal effect: if you learn about a node C that was previously hidden, the direct path from node A to node B can reverse sign if the indirect path from A to C and C to B takes over the strength of the old connection from A to B. (This has requirements on C, which I think matches up with your statement about inhomogenous data.)

In this view, noise is just a very convenient way to add new nodes which can cause these reversals, especially when the noise is multidimensional. So when I look at Simpson's paradox and RRE, the underlying mechanism I see is that there's a reversal effect going on in each, and so they look fundamentally similar to me. I got the impression from your post that you think there's a fundamental difference between them, and I don't see that difference- but I might have misread you, or overestimated what you think the size of the difference is.

19

Understanding Simpson's Paradox

19

19

19

Understanding Simpson's Paradox

19

19