Simpson's Paradox

bentarm

This is my first attempt at an elementary statistics post, which I hope is suitable for Less Wrong. I am going to present a discussion of a statistical phenomenon known as Simpson's Paradox. This isn't a paradox, and it wasn't actually discovered by Simpson, but that's the name everybody uses for it, so it's the name I'm going to stick with. Along the way, we'll get some very basic practice at calculating conditional probabilities.

A worked example

The example I've chosen is an exercise from a university statistics course that I have taught on for the past few years. It is by far the most interesting exercise in the entire course, and it goes as follows:

You are a doctor in charge of a large hospital, and you have to decide which treatment should be used for a particular disease. You have the following data from last month: there were 390 patients with the disease. Treatment A was given to 160 patients of whom 100 were men and 60 were women; 20 of the men and 40 of the women recovered. Treatment B was given to 230 patients of whom 210 were men and 20 were women; 50 of the men and 15 of the women recovered. Which treatment would you recommend we use for people with the disease in future?

The simplest way to represent these sort of data is to draw a table, we can then pick the relevant numbers out of the table to calculate the required conditional probabilities.

Overall

	A	B
lived	60	65
died	100	165

The probability that a randomly chosen person survived if they were given treatment A is 60/160 = 0.375

The probability that a randomly chosen person survived if they were given treatment B is 65/230 = 0.283

So a randomly chosen person given treatment A was more likely to surive than a randomly chosen person given treatment B. Looks like we'd better give people treatment A.

However, since were given a breakdown of the data by gender, let's look and see if treatment A is better for both genders, or if it gets all of its advantage from one or the other.

Women

	A	B
lived	40	15
died	20	5

The probability that a randomly chosen woman survived given that they were given treatment A is 40/60 = 0.67

The probability that a randomly chosen woman survived given that they were given treatment B is 15/20 = 0.75

So it looks like treatment B is better for women. Guess that means treatment A must be much better for men, in order to be better overall. Let's take a closer look.

Men

	A	B
lived	20	50
died	80	160

The probability that a randomly chosen man survived given that they were given treatment A is 20/100 = 0.2

The probability that a randomly chosen man survived given that they were given treatment B is 50/210 = 0.238

So a randomly chosen man was more likely to survive if given treatment B than treatment A. What is going on here?

Treatment A, which seemed better in the overall data, was worse for both men and women when considered separately.

This, in essence, is Simpson's Paradox, partitioning data can result in a reversal of the correlations present in the aggregated data. Why does this happen? Well, essentially for two reasons. Firstly, the treatments were given to different numbers of people - treatment A was used much less often than treatment B in the example data, and secondly (and probably more importantly) the aggregation is hiding a confounding variable. Treatment B was much more likely to be given to men than to women, and men are much less likely than women to survive the disease, this obviously makes treatment B look worse in the aggregated data.

So, you might think, we've sorted things out. Gender was the missing variable, and we now know that we can safely give everyone treatment B. Well, if I were writing the exercises for the course I teach on, I would have included the following follow-up question.

Yet Another Variable

It turns out that gender wasn't the only data that were collected about the patients. For the men, we also noted whether they were had any family history of heart disease. of the men given treatment A, 80 had a family history of heart disease, 10 of these survived. Of the men given treatment B, 55 had a family history of heart disease, 5 of these survived. The data now break down as follows:

History of heart disease

	A	B
lived	10	5
died	70	50

No history of heart disease

	A	B
lived	10	45
died	10	110

This time I will leave the calculations as an exercise to the reader but, as you can see, things have changed again. We can keep playing this game all day.

Which data to use?

This leaves us with the important question, which data should we use when making our decisions? Given a randomly chosen person, it looks like treatment A is better than treatment B. But any randomly chosen person is either a man or a woman, and whichever they are, treatment B is better than treatment A. But let's say the randomly chosen person is a man, then we could ask them whether or not they have a family history of heart disease and whichever answer they give, we will prefer to give them treatment A.

It may appear that the partitioned data always give a better answer than the aggregated data. Unfortunately, this just isn't true. I made up the numbers in the previous example five minutes ago in order to reverse the correlation in the original exercise. Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation. How are we to decide which partitions are useful? If someone tells us that women born under Aries, Leo or Sagittarius do better with treatment A, as do those born under the Earth, Air and Water signs, would we really be willing to switch treatments?

As you might expect, Judea Pearl has an answer to this problem (in chapter 6 of [1]). If we draw the relevant causal networks, we can formally decide which variables are confounding and so which partitions we should use (he quotes a further famous examples in which it is shown that you might want to use different versions of the same data depending on how they were acquired!), but that's another post for another time (and probably for someone better acquainted with Pearl than I am). In the meantime we should take Simpson's Paradox as a further warning of the dangers of drawing causal conclusions from data without understanding where the causes come from.

In Real Life

I'll finish with a famous real life example. In 1975, there was a study published [2] which demonstrated that 44% of male graduate applicants for graduate programmes at Berkeley were being accepted, whereas only 35% or female applicants were. This was obviously a pretty serious problem, so the authors decided to have a closer look, to try and see which departments in particular were most guilty of discrimination.

As you'll be expecting by now, what they found was that not only were most of the departments not biased at all, in fact, there were more which were biased in favour of women than there were in favour of men! The confounding variable that was found was that women were applying for more competitive departments than men... of course, as we've seen, it's just possible that something else was hiding in the data.

There are several other real-life examples. You can find a few in the wikipedia article on Simpson's Paradox. Batting averages are a common toy example. It's possible for one player to have a better average than another every season for his entire career, and a worse average overall. Similar phenomena are not particularly unusual in medical data - treatments which are given to patients with more serious ilnesses are always going to look worse in aggregate data. One of my personal favourite examples is that countries which put fluoride in the water have significantly more people who require false teeth than those which don't. As usual, there's a hidden variable lurking.

References:

(1) Judea Pearl. Causality: Models, Reasoning, and Inference, Cambridge University Press (2000, 2nd edition 2009)

(2) P.J. Bickel, E.A. Hammel and J.W. O'Connell (1975). "Sex Bias in Graduate Admissions: Data From Berkeley". Science 187 (4175): 398–404

I'd really like to see the follow-up on how to decide which data to actually use. Right now, it's pretty unsatisfactory and I'm left quite confused.

(Unless this was an elaborate plot to get me to read Judea Pearl, whose book I just picked up, in which case, gratz.)

The short of it, having read a few of Pearl's papers and taken a lecture with him, is that you build causal networks including every variable you think of and then use physical assumptions to eliminate some edges from the fully connected (assumption free) graph.

With this partially connected causal graph, Pearl identifies a number of structures which allow you to estimate correlations where all identified confounding variables are corrected for (which can be interpreted as causation under the assumptions of your graph).

Often times, it seems like these methods only serve to show you just how bad a situation "estimation causation" actually is, but it's possible to design experiments (or get lucky, or make strong assumptions) so as to turn them into useful tools.

Clearly one could split a data set using basically any possible variable, but most are obviously wrong. (That is to say, they lack explanatory power, and are actually irrelevant.) To attempt to simplify, then, if you understand a system, or have a good hypothesis, it is frequently easier to pick variables that should be important, and gather further data to confirm.

This is made explicit in removing connections from the graph. The more "obviously" "wrong" connections you sever, the more powerful the graph becomes. This is potentially harmful, though, since like assigning 0 probability weight to some outcome, once you sever a connection you lose the machinery to reason about it. If your "obvious" belief proves incorrect, you've backed yourself into a room with no escape. Therefore, test your assumptions.

This is actually a huge component of Pearl's methods since his belief is that the very mechanism of adding causal reasoning to probability is to include "counterfactual" statements that encode causation into these graphs. Without counterfactuals, you're sunk. With them, you have a whole new set of concerns but are also made more powerful.

It's also really, really important to dispute that "one could split a data set using basically any possible variable". While this is true in principle, Pearl made/confirmed some great discoveries by his causal networks which helped to show that certain sets of conditioning variables will, when selected together, actively mislead you. Moreover, without using counterfactual information encoded in a causal graph, you cannot discover which variables these are.

Finally, I'd just like to suggest that picking a good hypothesis, coming to understand a system; these are undoubtedly the hardest part of knowledge involving creativity, risk, and some of the most developed probabilistic arguments. Actually making comparisons between competing hypotheses such that you can end up with a good model and know what "should be important" is the tough part fraught with possibility of failure.

wholeheartedly agree, but as I said, I'm not confident that I understand Pearl well enough to be the one to write it.

I was first exposed to this in the context of baseball batting averages. I'll relate that example in case it helps someone.

{For those unfamiliar with baseball, a player's "batting average" is the number of hits the player has made divided by the number of hit attemps, also known as "at-bats." Ruth and Gehrig were stars in the early 20th century. Ruth did enjoy a drink, and Gehrig never missed a game. The rest is made up.)

In 1927, Gehrig (injured but playing every game) and Ruth (on a months-long drinking tear, therefore sitting out lots of games while staggering through the rest) both performed terribly at the plate during the first half of the season: By the All-Star break, Ruth's batting average was a pitiful .190, while Gehrig's was only slightly less anemic at .200.

During the second half of the season, a dried-out Ruth and healthy Gehrig tore up the league, batting a torrid .390 and .400 respectively.

Yet despite the fact that Gehrig's batting average exceeded Ruth's in each half of the season, Ruth's average over the entire season was greater than Gehrig's. How can this be?

Answer: Remember all of those games that a hung-over Ruth sat out during the first half? The result was that far fewer of Ruth's at-bats occurred during the dismal first half of the season than during the torrid second half. As a result, Ruth's overall season average was determined to a greater extent by his second half performance. Gehrig's season average, by contrast, was midway between his averages for the two halves. Here are the numbers:

Ruth: 19 hits /100 at-bats = .190 in 1st half, 78/200 =.390 in 2nd half, 97/300 = .323 overall Gehrig: 40/200 = .200 in 1st half, 80/200 = .400 in 2nd half, 120/300 = .300 overall

I really appreciate the post. I wasn’t consciously aware of the phenomenon, and no idea of its name. Still, I feel I didn’t quite get it after reading. I thought I expected a “what to do about it” section that was missing. But then I got to Wikipedia, and the very simple picture at the top of the page completely dissolved the confusion in a way the numeric examples didn’t.

I seem to have a very visual mechanism for understanding stuff (which is kind of strange because my visual memory sucks). I guess the point is that visuals are very helpful for promoting understanding in at least some people. Which isn’t news, but it’s probably useful to remind of it.

I myself tend to rely too much on visual aids, forgetting that people who are not me often don’t interpret them as easily as I do.

I just realized that Eliezer doesn’t use visuals in the sense of graphics in his posts, but I almost always “get” them very quickly. I think his “story-examples” do the job. (When reading a story, I tend to visualize it, sort of “how would this look like in a movie”, which might be why they work for me.)

Hmmm. This makes me think of something.

You know that example that Eliezer gives in the Fun Theory sequence; about how solving a rubik's cube will be fun a few times, and then you might move onto to solving the general formula for a rubik's cube of nxnxn... and once you've solved that formula, then solving a specific rubik's cube will be boring.

Perhaps learning follows a similar pattern, in that retention is improved by first learning a specific solution to a specific problem, and then finding the general solution to the problem set.

You know that example that Eliezer gives in the Fun Theory sequence; about how solving a rubik's cube will be fun a few times, and then you might move onto to solving the general formula for a rubik's cube of nxnxn... and once you've solved that formula, then solving a specific rubik's cube will be boring.

Although of course actual observation of humans seems to disagree. People move on to practising for speed, competing and solving the cube blindfolded after making a brief glance.

Sure, but you're getting a different, mostly unrelated kind of fun out of it. Solving a Rubik's cube is a challenge in puzzle-solving and a little math; speed-solving and blind-solving are challenges in manual dexterity and spatial memorisation. In many ways you're playing two different games, just using the same tool.

It's like winning at Civilization versus recreating as accurate a copy of a given historical empire as possible.

It seems learning follows the pattern more strongly than rubik's cube-solving does. People (generally) don't practice the same solution to a problem over and over again to get faster at it; they tend to learn more general methods that include the specific problem. Idea is only nebulous, need to think it over more.

It seems learning follows the pattern more strongly than rubik's cube-solving does.

Definitely. And when it comes to the Rubik's cube I personally tackled it as a learning problem more than a practical skill - so closer to how Eliezer used it in the example. I learned how to solve the cube in general then moved on. I saved my competitive skill acquisition for martial arts and laser tag. :)

Technical-meta: For future-proofing and accessibility, please replace the images of tables in your article with HTML tables.

I'd love to, and would have done this originally had I been able to figure out how - how do I alter the HTML in the post manually? (Or, alternatively, how do I tell the editor to put a table in for me)

There is a button in the formatting toolbar labeled "HTML".

thanks, can't believe I missed that... fixed

Great post. Simply introduces an important phenomenon to be aware of when reasoning about things in an easily understandable way. I'd love to see more articles like this.

Good post, thanks. One comment:

It may appear that the partitioned data always give a better answer than the segregated data. Unfortunately, this just isn't true.

First, I assume you mean "aggregated", otherwise this statement doesn't make sense.

Second, I don't believe you. I say it's always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you're always better off building two models (one for each gender) instead of one big model. Why throw away information?

There is a nugget of truth to your claim, which is that sometimes the partitioning strategy becomes impractical. To see why, consider what happens when you first partition on gender, then on history of heart disease. The number of partitions jumps from two to four, meaning there are fewer data samples in each partition. When you add a couple more variables, you will have more partitions than data samples, meaning that most partitions will be empty.

So you don't always want to do as much partitioning as you plausibly could. Instead, you want to figure out how to combine single partition statistics corresponding to each condition (gender, history,etc) into one large predictive model. This can be attacked with techniques like AdaBoost or MaxEnt.

Second, I don't believe you. I say it's always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you're always better off building two models (one for each gender) instead of one big model. Why throw away information?

Because, as Von Neumann was supposed to have said, "with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Unless your data is good enough to support the existence of the other factors, or you have other data available that does so, a model you fit to the lowest-level data is likely to capture more noise than reality.

Right, so the challenge is to incorporate as much auxiliary information as possible without overfitting. That's what AdaBoost does - if you run it for T rounds, the complexity of the model you get is linear in T, not exponential as you would get from fitting the model to the finest partitions.

This is in general one of the advantages of Bayesian statistics in that you can split the line between aggregate and separated data with techniques that automatically include partial pooling and information sharing between various levels of the analysis. (See pretty much anything written by Andrew Gelman, but Bayesian Data Analysis is a great book to cover Gelman's whole perspective.)

The OP's assertion is true. Stratifying on certain variables can introduce bias.

Consider that you have a cohort of initially healthy men, and you are trying to quantify the causal relationship between an exposure (eg eating hamburgers) and an outcome (eg death). You have also measured a third variable, which is angina pectoris (cardiovascular disease).

Assume that the true underlying causal structure, which you are unaware of, is that hamburgers cause cardiovascular disease, which subsequently causes death.

Now look at what happens if you stratify on cardiovascular disease: In the strata consisting of men who don’t have cardiovascular disease, you will find no cases of death. This will lead you to conclude that in men who don’t have cardiovascular disease, eating hamburgers does not cause death. This is false, as eating hamburgers will cause them to develop cardiovascular disease and then die.

What you have done in this situation, is stratify on a mediator, thereby “blocking” the pathway running through it. There are also many other situations in which adjusting for a variable introduces bias, but it gets more complicated from here.

For further information on this I suggest reading an upcoming book called “Causal Inference”, by James Robins and Miguel Hernan, who taught me this material. The first ten chapters are available for free online at http://www.hsph.harvard.edu/faculty/miguel-hernan/files/hernanrobins_v1.10.9.pdf .

Second, I don't believe you. I say it's always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you're always better off building two models (one for each gender) instead of one big model. Why throw away information?

If you believe the OP's assertion

Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation

then it is demonstrably false that your strategy always improves matters. Why do you believe that your strategy is better?

Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.

Let's do this formally. Let R, G, and T be the three variables of interest in the OP's example, corresponding to Recovery, Gender, and Treatment. Then the goal is to obtain a model of the probability of R, given T and maybe G. My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone - you can't gain anything by throwing away the G variable. The accuracy can be measured in terms of the log-likelihood of the data given the model.

It is actually tautologically true that P(R|G,T) will provide a higher log-likelihood than P(R|T). The issue raised by RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity. That will certainly happen if naive modeling methods are used, but there are ways to incorporate multiple information sources without overfitting.

Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.

Usually. But, partitioning reduces the number of samples within each partition, and can thus increase the effects of chance. This is even worse if you have a lot of variables floating around that you can partition against. At some point it becomes easy to choose a partition that purely by coincidence is apparently very predictive on this data set, but that actually has no causal role.

RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity.

Exactly.

My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone - you can't gain anything by throwing away the G variable.

That's all true (modulo the objection about overfitting). However, there is the case where T affects G which in turn affects R. (Presumably this doesn't apply when T = treatment and G = gender). If what we're interested in is the effect of T on R (irrespective of which other variables 'transmit' the causal influence) then conditioning on G may obscure the pattern we're trying to detect.

(Apologies for not writing the above paragraph using rigorous language, but hopefully the point is obvious enough.)

Let's say the only data we'd collected were gender and whether or not the patient's birthday was a Tuesday. Do you really think there is something to be gained from building four separate models now?

More seriously, if you collect enough information, then purely by chance there will be some partitioning of the data which gives the wrong conclusion.

I don't think we disagree on anything important here - the main point is that you need to be careful when choosing which partitions of the data you use - arbitrarily partitioning along every available divide is not optimal.

PS - thanks for the typo correction, I really need to learn to proofread...

Don't partition on things that are caused by the factor you want to analyze.

I can see this blog is going to be a very fine addition to my reading material. When I've absorbed all of this data, I will take over the world but for now I silently read and plot. {Wringing hands with glee and chuckling quietly.} That reminds me. I need to get a good night's sleep one of these days. Anyways, excellent EXCELLENT blog. Kudos!

When I've absorbed all of this data, I will take over the world

Take a number and get in line.

This isn't a paradox, and it wasn't actually discovered by Simpson

Blasphemy!

Nice. I'd never heard of Simpson's paradox before, and I don't think this sort of statistical-interpretation glitch would have occurred to me as a possibility to stay aware of in real-world cases (e.g., the Berkeley study), despite understanding the underlying math, although it's obvious once pointed out.

New tool in the toolbox: yay! Thank you.

This post is very similar to a comment I made back in November. I love the concept and have been noticing it a lot. Other examples of "paradoxes" that may be interesting to LW: Braess's, Sen's.

Isn't Sen's paradox essentially the same as the Prisoners' dilemma?

Thanks for the links cousin it! Great reads.

Re: prase's reply: The Prisoners' dilemma is a legitimate dilemma. No matter how many times I read the page on Sen's paradox I can't interpret it as anything remotely sensical.

I kept editing this post again and again as I boiled down the problem (it's harder to explain something wrong than something correct), and I think I've got it down to one sentence:

If you just look at sorted lists of preferences without any comparative weights of given preferences, you're going to get paradoxes. Nash equilibrium exists because of weights. Sen's paradox does not exist because it precludes weights. If Bob just wants Alice to see a movie and doesn't much care about his own viewing of the film either way, and Alice just wants to spend time with Bob and doesn't much care if it's by watching a movie or not, then there's no paradox until a social scientist walks in the room.

This is a great explanation of something I found very confusing until now, thanks for the post. However, "The example I've chosen an exercise from a university statistics course" crashed my language parser for a few seconds.

Randomization of test subjects...

I've had this in the back of my mind for the past week, and finally put my finger on how this problem is solved in most experimental sciences. Sorry if I've overlooked part of the discussion, but the typical solution to this problem is to randomly assign subjects to the two groups in the experiment. That way, it is very unlikely that some underlying variable will distort the outcome in the way that the sex of the subjects did in the above example, where the women were concentrated in the A group and men in the B group.

Of course, you can't always randomize assignment to the control and treatment group, but you could in the example given (testing a medical intervention).

Something is wrong with the numbers here:

The probability that a randomly chosen man surived given that they were given treatment A is 40/100 = 0.2

thanks for pointing this out. Fixed.

Check the chart- the 40 is a typo. It should be 20/100=0.2

Oh, right. Somehow I was expecting it to be 40 and 0.4. Now it makes sense.

I was first exposed to this in the context of baseball batting averages. I'll relate that example in case it helps someone.

During the second half of the season, a dried-out Ruth and healthy Gehrig tore up the league, batting a torrid .390 and .400 respectively.

Yet despite the fact that Gehrig's batting average exceeded Ruth's in each half of the season, Ruth's average over the entire season was greater than Gehrig's. How can this be?

Ruth: 19 hits /100 at-bats = .190 in 1st half, 78/200 =.390 in 2nd half, 97/300 = .323 overall Gehrig: 40/200 = .200 in 1st half, 80/200 = .400 in 2nd half, 120/300 = .300 overall

How are we to decide which partitions are useful? If someone tells us that women born under Aries, Leo or Sagittarius do better with treatment A, as do those born under the Earth, Air and Water signs, would we really be willing to switch treatments?

Assume I have not heard of Simpson's Paradox, have no more time to research and must make a decision now.

Am I justified in not switching treatments using the reasoning that I don't want Astrology to have any substance to it, and it must not be allowed to have any, so I'm going to wishful think this data away and therefore ignore the evidence I have (as I understand it)?

Or am I more rational to say "I will accept the evidence I have as far as I understand it and switch treatments, even though I expect there is something else going on which is nothing to do with Astrology, but I have no time to find out what that is"?

The second has a better thought process but leads to a worse conclusion due to lack of understanding or lack of information, but the first one is based on a worse thought process which could lead to much worse outcomes in future if it is kept up.

Well, an important question to ask is how the data were generated. If the only thing we know about each patient is whether they were male or female and whether they were born under a Fire sign, and being born under a Fire sign seems to have some explanatory power, then by all means go for it. As Dave suggests below - it is perfectly possible that the astrological data is hiding some genuine phenomenon.

However, if someone collected every possible piece of astrological data, and tried splitting the patients along every one of the 2^11 possible partitions of the twelve starsigns, you would not be surprised to find that at least one of them displayed this sort of behaviour.

I think the key message is that you shouldn't be making causal inferences from correlational conclusions unless you have some good reason to do so.

The urge to infer causation from correlation must be powerful. We can easily spot errors of unwarranted causal inferences, apparently from overtraining the recognition of certain patterns, but as soon as the same caveat is expressed in a novel way, we have to work to apply the principle to novelties of form. Simpson's Paradox seems not just the bearer of the message that you shouldn't make automatic causal inferences from mere correlation; it is an explanation of why that inference is invalid.. A blind correlation 1) doesn't screen out confounds, and 2) might screen out the causal factor.

It seems that we've learned part 1 well, but the complete explanation for the possibility that correlations hide causes includes part 2. It seems part 2 is harder. While we've all learned to spot instances of part 1, we still founder on part 2. We're inclined to think partitioning the data can't make the situation epistemically worse, but it can by screening out the wrong variable, that is, the causal variable.

So in the real life example, we don't find it so counter-intuitive that data about the success rates of men and women fail to prove discrimination when you don't control for the confounds. But we do stumble when it goes the other way. If we had the data that women do better than men for the competitive petitions as well as the easy positions, we continue to find it hard to see that this doesn't prove that women overall don't do better than men.

I think the key message is that you shouldn't be making causal inferences from correlational conclusions

I've sorted out what I was thinking a bit more. I was not saying "am I justified in believing that the alignment of stars and planets is the cause here", what I was saying is:

If someone tells us that women born under Aries, Leo or Sagittarius do better with treatment A, as do those born under the Earth, Air and Water signs, would we really be willing to switch treatments?

Yes we should be willing to act in a way that appears to support astrology - this paragraph is supporting wisdom as the opposite of stupidity, or decision making by fear of public embarrassment.

It might even lead to worse outcomes in the current case, if it turns out that the reason Water signs do better with treatment A in this data set is that the assignment of subjects to treatments in the study was in some way related to their date of birth.

If I have good reasons to believe that factor X doesn't cause events of class Y, and I have data that seems to demonstrate that factor X is causing an event of class Y in one particular case, and I don't have the time to explore that data further, I ought to take seriously the theory that the causation is not what it seems to be.

I found this to be a really helpful visualization tool, that lets you set the parameters and see the effects: http://vudlab.com/simpsons/

1) The total number of Gehrig's at-bats for the season in my previous comment should have been 400, not 300 (and, sorry for the duplication)

2) I haven't seen many attempts to actually answer the question in the posting. I'll stick out my neck, after making a couple of simplifying assumptions: ASSUMING that the mortaility frequencies are reflective of the true underlying probabilities, and that the assignment of treatments to patients was otherwise random, I'd use treatment A on men regardless whether they have a history of heart disease, and treatment B on women. (In the real world, where these assumptions don't necessarily hold, I'd have to think a lot harder about the unreliability of the smaller sample sizes, and of course I'd try to find out all I could about further confounding factors, the rules that were used for treatment selection, potential mechanisms underlying the gender and history effects, etc.) Critical comments invited.

Critical comments invited.

My only criticism is that you never really explain why you would do as you say. Perhaps it is as Daniel Burfoot said here, that for each person you are simply using the best information that you have?

Does anyone know an example where the result is intuitive to most people? The sort of thing where you tell someone the Monty Hall problem with a million doors before they think about the case with 3.

There is an example above with mortality for people with/without heart disease based on eating hamburgers; if you partition by having heart disease, you falsely would infer that in some of the population, hamburgers aren't bad for you. Obviously, the real issue is that actually the hamburgers are what initially caused the heart disease.

I am sure there are better examples, but this one will suffice. If anyone gives it some though and comes up with a better one, that would be great.

I would guess it would be something with an obvious confounding variable that you've forgotten to factor out. e.g. the "fluoride causes false teeth" example is pretty good for that, but is maybe not quite intuitive enough for what you're looking for.

Don't partition on things that are caused by the factor you want to analyze.