gwern comments on LW Women: LW Online - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (590)
Not quite. (Saving assumptions for the end of the comment.) If a female got a 499 on the Math SAT, then my estimate of her real score is centered on 499. If she scores a 532, then my estimate is centered on 530; a 600, 593; an 800, 780. A 20 point penalty is bigger than a 7 point penalty, but 780 is bigger than 593, so if by "it" you mean "math" that's not the right way to look at it, but if by "it" you mean "that particular score" then yes.
Note that this should also be done to male scores, with the appropriate means and standard deviations. (The std difference was smaller than I remembered it being, so the mean effect will probably dominate.) Males scoring 499, 532, 600, and 800 would be estimated as actually getting 501, 532, 596, and 784. So at the 800 level, the relative penalty for being female would only be 4 points, not the 20 it first appears to be.
Note that I'm pretending that the score is from 2012, the SAT is normally distributed with mean and variances reported here, the standard measurement error is 30, and I'm multiplying Gaussian distributions as discussed here. The 2nd and 3rd assumptions are good near the middle but weak at the ends; the calculation done at 800 is almost certainly incorrect, because we can't tell the difference between a 3 or 4 sigma mathematician, both of whom would most likely score 800; we could correct for that by integrating, but that's too much work for a brief explanation. Note also that the truncation of the normal distribution by having a max and min score probably underestimates the underlying standard deviations, and so the effect would probably be more pronounced with a better test.
Another way to think about this is that a 2.25 sigma male mathematician will score 800, but a 2.66 sigma female mathematician is necessary to score 800, and >2.25 sigmas are 12 out of a thousand, whereas >2.66 sigmas are 4 out of a thousand.
This isn't necessary if the prior comes from data that includes the individual in question, and is practically unnecessary in cases where the individual doesn't appreciably change the distribution. Enough females take the SAT that one more female scorer won't move the mean or std enough to be noticeable at the precision that they report it.
In the writing example, where we're dealing with a long tail, then it's not clear how to deal with the sampling issues. You'd probably make an estimate for the current individual under consideration just using historical data as your prior, and then incorporate them in the historical data for the next individual under consideration, but you might include them before doing the estimation. I'm sure there's a statistician who's thought about this much longer and more rigorously than I have.
I'm not sure you're using the right numbers for the variability. The material I'm finding online indicates that '30 points with 67% confidence' is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
The 2011 SAT test-retest reliabilities are all around 0.9 (the math section is 0.91-0.93), so that's 10%.
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 - 499) * 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 - 532) * 0.1) = 773.2.
Hmm. You're right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I'll edit the grandparent to use the correct terms.
I'm not sure I agree with your method because it seems odd to me that the standard deviation doesn't impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that'd probably be the simpler way to calculate it.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn't come into it because we've already singled out a specific datapoint; we're not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I've run through the math, I agree with your method. Supposing the measurement error is independent of score (which can't be true because of the bounds, and in general probably isn't true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn't take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don't think that using a linear correction based on the reliability coefficient would get that right, but I haven't worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.