Understanding Simpson's Paradox

Vaniver

19 Understanding Simpson's Paradox

18th Sep 2013

2 min read

19

An article by Judea Pearl, available here. It's quick at 8 pages, and worth reading if you enjoy statistics (though I think people who already are familiar with the math of causality¹ will get more out of it than others²). I'll talk here about the part that I think is generally interesting:

Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson's paradox and, naturally, will proceed to convince readers that the paradox deserves the title "resolved."

I've never really liked the name "paradox," because what it seems to mean is "unintuitive phenomenon." (Wikipedia puts it as "something which seems false and yet might be true.") The trouble is that "unintuitive" is a two-place word, and it makes sense to think like reality, so that true things seem true to you, instead of still seeming false. (For example, when I first learned about Zeno's Paradox, I already knew calculus, and so Zeno's position was the one that seemed confusing and false.)

What I like most about Pearl's article is that it explicitly recognizes the importance of fully dissolving the paradox,³ and seems to do so. Simpson's Paradox isn't an unsolvable problem in statistics, it's a straightforward reversal effect--only if you use the language of causality.

1. My review of Causality gives a taste of what it would look like to be familiar with the math, but you'd need to actually read the book to pick it up. The Highly Advanced Epistemology 101 for Beginners sequence is relevant, and contains Eliezer's attempt to explain the basics of causality in Causal Diagrams and Causal Models.

2. Pearl discusses how you would go about using simulations to show that do calculus gives you the right result, but leaves it as an exercise for the reader.

3. How An Algorithm Feels From Inside is probably a better place to start than Dissolving the Question, and I can't help but echo a question from it: "So what kind of math design corresponds to [Simpson's Paradox]?"

See also: bentarm's explanation of Simpson's Paradox.

Personal Blog

19

New Comment

Rendering 0/20 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 8:16 AM

Moderation Log

19 Understanding Simpson's Paradox

by Vaniver

18th Sep 2013

2 min read

19

Any claim to a resolution of a paradox, especially one that has resisted a century of attempted resolution must meet certain criteria. First and foremost, the solution must explain why people consider the phenomenon surprising or unbelievable. Second, the solution must identify the class of scenarios in which the paradox may surface, and distinguish it from scenarios where it will surely not surface. Finally, in those scenarios where the paradox leads to indecision, we must identify the correct answer, explain the features of the scenario that lead to that choice, and prove mathematically that the answer chosen is indeed correct. The next three subsections will describe how these three requirements are met in the case of Simpson's paradox and, naturally, will proceed to convince readers that the paradox deserves the title "resolved."

2. Pearl discusses how you would go about using simulations to show that do calculus gives you the right result, but leaves it as an exercise for the reader.

See also: bentarm's explanation of Simpson's Paradox.

Personal Blog

19

Mentioned in

39Interactive Infographic on Simpson's Paradox

New Comment

Rendering 0/20 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 8:16 AM

Moderation Log

More from Vaniver

Curated and popular this week

20Comments

Comment Permalink

tgb13y50

I have a question that I can't work out. From Pearl's Causality book (the 2000 version with the excellent commentary in the back), I read on page 356:

Another example involves a controversy called "reverse regression," which occupied the social science literature in the 1970s. Should we, in salary discrimination cases, compare salaries of equally qualified men and women or instead compare qualifications of equally paid men and women?

Remarkably, the two choices led to opposite conclusions. It turned out that men earned a higher salary than equally qualified women and, simultaneously, men were more qualified than equally paid women. The moral is that all conclusions are extremely sensitive to which variables we choose to hold constant when we are comparing, and that is why the adjustment problem is so critical in the analysis of observational studies.

My problem is that I cannot imagine world in which men earn more than equally qualified women, men are more qualified than equally paid women, and that more qualified men (respectively, women) are paid more than more qualified men (respectively, women). There does not appear to be such a set of points in the space (Wages) x (Qualifications) x (Genders) where all of these conditions hold true. Since Pearl asserts the first two, do I have to get rid of the idea that more qualifications lead to more pay? I can't see any other way out of the bind.

(My reasoning for why this appears to be impossible: start with the assumption of the first two conditions (i.e. Pearl's assertions). Consider a man of some qualifications and pay. A woman A as qualified as him earn less. A woman B who earn as much as him are more qualified. But the slope of the qualifications-wages line between woman A and woman B goes the wrong way for qualifications to be positively correlated to wages - the less qualified woman earns more! So if this is possible, there's something quite unintuitive going on with the distributions.)

Showing 3 of 4 replies (Click to show all)

Emile13y200

Let's take a world with 10 people and 4 jobs:

Engineer (high-education, high-pay): 2 men and 1 woman
Teacher (high-education, low-pay): 1 man and 1 woman
Plumber (low-education, high-pay): 1 man and 1 woman
Cleaner (low-education, low-pay): 1 man and 2 women

If you control for education:

50% of uneducated men have high-paying jobs, versus 33% of uneducated women
66% of educated men have high-paying jobs, versus 50% of educated women

... and if you control for pay:

66% of high-salary men are educated, versus 50% of high-salary women
50% of low-salary

... (read more)

0Vaniver13y

You have a typo here, I think- suppose the man has qualification 2 and salary 2. A has qualification 2 and earns 1, but B has salary 2 and qualification 3. The line is positive. If B has salary 2 and qualification 1 (i.e. the man is more qualified, not the woman more qualified) then this matches the description and the line is negative.

7roystgnr13y

I guess the trick is that, if you're using a standard least-squares fitting to find your regressions, the linear fit that you get by minimizing the sum of squared errors in one variable is not the same as the linear fit that you get by minimizing the sum of squared errors in the other variable. So as long as the true data isn't a simple line, but rather a noisy distribution or a nonlinear relation, you can get different pairs of lines depending on which minimization problem you solve. This discussion of the problem is a little less hand-wavy than my above guess, it includes a (visual) example of the paradox and it seems to agree that having noisy data is critical to the problem. Oh, and it seems to have been written by a LessWrong reader. I thought that lingo looked oddly familiar.

See in context