gwern comments on LW Women: LW Online - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (590)
I'm not sure you're using the right numbers for the variability. The material I'm finding online indicates that '30 points with 67% confidence' is not the meaningful number, but simply the r correlation between 2 administrations of the SAT: the percent of regression is 100*(1-r).
The 2011 SAT test-retest reliabilities are all around 0.9 (the math section is 0.91-0.93), so that's 10%.
Using your female math mean of 499, a female score of 800 would be regressed to 800 - ((800 - 499) * 0.1) = 769.9. Using your male math mean of 532, then a male score of 800 would regress down to 800 - ((800 - 532) * 0.1) = 773.2.
Hmm. You're right that test-retest reliability typically refers to a correlation coefficient, and I was using the standard error of measurement. I'll edit the grandparent to use the correct terms.
I'm not sure I agree with your method because it seems odd to me that the standard deviation doesn't impact the magnitude of the regression to the mean effect. It seems like you could calculate the test-retest reliability coefficient from the population mean, population std, and standard measurement error std, and there might be different reliability coefficients for male and female test-takers, and then that'd probably be the simpler way to calculate it.
Well, it delivers reasonable numbers, it seems to me that one ought to employ reliability somehow, is supported by the two links I gave, and makes sense to me: standard deviation doesn't come into it because we've already singled out a specific datapoint; we're not asking how many test-scorers will hit 800 (where standard deviation would be very important) but given that a test scorer has hit 800, how will they fall back?
Now that I've run through the math, I agree with your method. Supposing the measurement error is independent of score (which can't be true because of the bounds, and in general probably isn't true), we can calculate the reliability coefficient by (pop var)/(pop var + measurement var)=.93 for women and .94 for men. The resulting formulas are the exact same, and the difference between the numbers I calculated and the numbers you calculated comes from our differing estimates of the reliability coefficient.
In general, the reliability coefficient doesn't take into account extra distributional knowledge. If you knew that scores were power-law distributed in the population but the test error were normally distributed, for example, then you would want to calculate the posterior the long way: with the population data as your prior distribution and the the measurement distribution as your likelihood ratio distribution, and the posterior is the renormalized product of the two. I don't think that using a linear correction based on the reliability coefficient would get that right, but I haven't worked it out to show the difference.
That makes sense, but I think the SAT is constructed like IQ tests to be normally rather than power-law distributed, so in this case we get away with a linear correlation like reliability.