This is my first attempt at an elementary statistics post, which I hope is suitable for Less Wrong. I am going to present a discussion of a statistical phenomenon known as Simpson's Paradox. This isn't a paradox, and it wasn't actually discovered by Simpson, but that's the name everybody uses for it, so it's the name I'm going to stick with. Along the way, we'll get some very basic practice at calculating conditional probabilities.
A worked example
The example I've chosen is an exercise from a university statistics course that I have taught on for the past few years. It is by far the most interesting exercise in the entire course, and it goes as follows:
You are a doctor in charge of a large hospital, and you have to decide which treatment should be used for a particular disease. You have the following data from last month: there were 390 patients with the disease. Treatment A was given to 160 patients of whom 100 were men and 60 were women; 20 of the men and 40 of the women recovered. Treatment B was given to 230 patients of whom 210 were men and 20 were women; 50 of the men and 15 of the women recovered. Which treatment would you recommend we use for people with the disease in future?
The simplest way to represent these sort of data is to draw a table, we can then pick the relevant numbers out of the table to calculate the required conditional probabilities.
Overall
A | B | |
lived | 60 | 65 |
died | 100 | 165 |
The probability that a randomly chosen person survived if they were given treatment A is 60/160 = 0.375
The probability that a randomly chosen person survived if they were given treatment B is 65/230 = 0.283
So a randomly chosen person given treatment A was more likely to surive than a randomly chosen person given treatment B. Looks like we'd better give people treatment A.
However, since were given a breakdown of the data by gender, let's look and see if treatment A is better for both genders, or if it gets all of its advantage from one or the other.
Women
A | B | |
lived | 40 | 15 |
died | 20 | 5 |
The probability that a randomly chosen woman survived given that they were given treatment A is 40/60 = 0.67
The probability that a randomly chosen woman survived given that they were given treatment B is 15/20 = 0.75
So it looks like treatment B is better for women. Guess that means treatment A must be much better for men, in order to be better overall. Let's take a closer look.
Men
A | B | |
lived | 20 | 50 |
died | 80 | 160 |
The probability that a randomly chosen man survived given that they were given treatment A is 20/100 = 0.2
The probability that a randomly chosen man survived given that they were given treatment B is 50/210 = 0.238
So a randomly chosen man was more likely to survive if given treatment B than treatment A. What is going on here?
Treatment A, which seemed better in the overall data, was worse for both men and women when considered separately.
This, in essence, is Simpson's Paradox, partitioning data can result in a reversal of the correlations present in the aggregated data. Why does this happen? Well, essentially for two reasons. Firstly, the treatments were given to different numbers of people - treatment A was used much less often than treatment B in the example data, and secondly (and probably more importantly) the aggregation is hiding a confounding variable. Treatment B was much more likely to be given to men than to women, and men are much less likely than women to survive the disease, this obviously makes treatment B look worse in the aggregated data.
So, you might think, we've sorted things out. Gender was the missing variable, and we now know that we can safely give everyone treatment B. Well, if I were writing the exercises for the course I teach on, I would have included the following follow-up question.
Yet Another Variable
It turns out that gender wasn't the only data that were collected about the patients. For the men, we also noted whether they were had any family history of heart disease. of the men given treatment A, 80 had a family history of heart disease, 10 of these survived. Of the men given treatment B, 55 had a family history of heart disease, 5 of these survived. The data now break down as follows:
History of heart disease
A | B | |
lived | 10 | 5 |
died | 70 | 50 |
No history of heart disease
A | B | |
lived | 10 | 45 |
died | 10 | 110 |
This time I will leave the calculations as an exercise to the reader but, as you can see, things have changed again. We can keep playing this game all day.
Which data to use?
This leaves us with the important question, which data should we use when making our decisions? Given a randomly chosen person, it looks like treatment A is better than treatment B. But any randomly chosen person is either a man or a woman, and whichever they are, treatment B is better than treatment A. But let's say the randomly chosen person is a man, then we could ask them whether or not they have a family history of heart disease and whichever answer they give, we will prefer to give them treatment A.
It may appear that the partitioned data always give a better answer than the aggregated data. Unfortunately, this just isn't true. I made up the numbers in the previous example five minutes ago in order to reverse the correlation in the original exercise. Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation. How are we to decide which partitions are useful? If someone tells us that women born under Aries, Leo or Sagittarius do better with treatment A, as do those born under the Earth, Air and Water signs, would we really be willing to switch treatments?
As you might expect, Judea Pearl has an answer to this problem (in chapter 6 of [1]). If we draw the relevant causal networks, we can formally decide which variables are confounding and so which partitions we should use (he quotes a further famous examples in which it is shown that you might want to use different versions of the same data depending on how they were acquired!), but that's another post for another time (and probably for someone better acquainted with Pearl than I am). In the meantime we should take Simpson's Paradox as a further warning of the dangers of drawing causal conclusions from data without understanding where the causes come from.
In Real Life
I'll finish with a famous real life example. In 1975, there was a study published [2] which demonstrated that 44% of male graduate applicants for graduate programmes at Berkeley were being accepted, whereas only 35% or female applicants were. This was obviously a pretty serious problem, so the authors decided to have a closer look, to try and see which departments in particular were most guilty of discrimination.
As you'll be expecting by now, what they found was that not only were most of the departments not biased at all, in fact, there were more which were biased in favour of women than there were in favour of men! The confounding variable that was found was that women were applying for more competitive departments than men... of course, as we've seen, it's just possible that something else was hiding in the data.
There are several other real-life examples. You can find a few in the wikipedia article on Simpson's Paradox. Batting averages are a common toy example. It's possible for one player to have a better average than another every season for his entire career, and a worse average overall. Similar phenomena are not particularly unusual in medical data - treatments which are given to patients with more serious ilnesses are always going to look worse in aggregate data. One of my personal favourite examples is that countries which put fluoride in the water have significantly more people who require false teeth than those which don't. As usual, there's a hidden variable lurking.
References:
(1) Judea Pearl. Causality: Models, Reasoning, and Inference, Cambridge University Press (2000, 2nd edition 2009)
(2) P.J. Bickel, E.A. Hammel and J.W. O'Connell (1975). "Sex Bias in Graduate Admissions: Data From Berkeley". Science 187 (4175): 398–404
Good post, thanks. One comment:
First, I assume you mean "aggregated", otherwise this statement doesn't make sense.
Second, I don't believe you. I say it's always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you're always better off building two models (one for each gender) instead of one big model. Why throw away information?
There is a nugget of truth to your claim, which is that sometimes the partitioning strategy becomes impractical. To see why, consider what happens when you first partition on gender, then on history of heart disease. The number of partitions jumps from two to four, meaning there are fewer data samples in each partition. When you add a couple more variables, you will have more partitions than data samples, meaning that most partitions will be empty.
So you don't always want to do as much partitioning as you plausibly could. Instead, you want to figure out how to combine single partition statistics corresponding to each condition (gender, history,etc) into one large predictive model. This can be attacked with techniques like AdaBoost or MaxEnt.
The OP's assertion is true. Stratifying on certain variables can introduce bias.
Consider that you have a cohort of initially healthy men, and you are trying to quantify the causal relationship between an exposure (eg eating hamburgers) and an outcome (eg death). You have also measured a third variable, which is angina pectoris (cardiovascular disease).
Assume that the true underlying causal structure, which you are unaware of, is that hamburgers cause cardiovascular disease, which subsequently causes death.
Now look at what happens if you stratify on cardi... (read more)