bentarm comments on Simpson's Paradox - Less Wrong

68 Post author: bentarm 12 January 2011 11:01PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread. Show more comments above.

Comment author: Daniel_Burfoot 13 January 2011 04:37:12AM *  3 points [-]

Good post, thanks. One comment:

It may appear that the partitioned data always give a better answer than the segregated data. Unfortunately, this just isn't true.

First, I assume you mean "aggregated", otherwise this statement doesn't make sense.

Second, I don't believe you. I say it's always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you're always better off building two models (one for each gender) instead of one big model. Why throw away information?

There is a nugget of truth to your claim, which is that sometimes the partitioning strategy becomes impractical. To see why, consider what happens when you first partition on gender, then on history of heart disease. The number of partitions jumps from two to four, meaning there are fewer data samples in each partition. When you add a couple more variables, you will have more partitions than data samples, meaning that most partitions will be empty.

So you don't always want to do as much partitioning as you plausibly could. Instead, you want to figure out how to combine single partition statistics corresponding to each condition (gender, history,etc) into one large predictive model. This can be attacked with techniques like AdaBoost or MaxEnt.

Comment author: bentarm 14 January 2011 01:09:49AM 2 points [-]

Let's say the only data we'd collected were gender and whether or not the patient's birthday was a Tuesday. Do you really think there is something to be gained from building four separate models now?

More seriously, if you collect enough information, then purely by chance there will be some partitioning of the data which gives the wrong conclusion.

I don't think we disagree on anything important here - the main point is that you need to be careful when choosing which partitions of the data you use - arbitrarily partitioning along every available divide is not optimal.

PS - thanks for the typo correction, I really need to learn to proofread...