Decius comments on Problems in Education - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (318)
So, "systematic bias or selection effect or regression to the mean" can result in average properly normed IQ scores increasing by 8 points? Doesn't the normalizing process (when done properly) force the average score to remain constant?
What normalizing process? You mean the one the paid psychometricians go through years before any specific test is purchased by researchers like the ones doing the Pygmalion study? Yeah, I suppose so, but that's irrelevant to the discussion.
Right- because the entire population going up half a SD in a year isn't unusual at all, and the test purchased for use in this study was normalized the way one would expect it to be, despite the fact that it had results that are impossible if it was normalized in that manner.
...'entire population'?
Alright, I have to admit I have no idea what test you are now referring to. I thought we were discussing the Pygmalion results in which a small sample of elementary school students turned in increased IQ scores, which could be explained by a number of well-known and perfectly ordinary processes.
But it seems like you're talking about something completely else and may be thinking of country-level Flynn effects or something, I have no idea what.
The PitC study showed an 8 point IQ increase in the control group. You offered those three explanations and said that they explained why that wasn't particularly unusual, and my understanding of normed IQ tests is that they are expected to remain constant over short times.
Over the general average population when tested once, yes. But the control group is neither general nor average nor the population nor tested once.
If the control group isn't at least representative, there is a different methodology flaw. If the confounding factor of prior IQ tests wasn't measured, given that there is apparently a significant increase in scores on the first retest (and presumably a diminishing increase in scores at some point; the expected result of taking the test very many times isn't to become the highest scorer ever), there is an unaccounted confounding factor.
I'm still trying to figure out what questions to ask before I dig up as much primary source as I can. Is "points of normed IQ" the right thing to measure? That would imply that going from an IQ of 140 to 152 is equally as much a gain as going from 94 to 106. Is raw score the right thing to measure? That would imply that going from being able to answer 75% of the questions accurately to 80% is equally as much gain as going from 25% to 30%. Is the percentage decrease in incorrect answers the correct metric? 75%-80% would be the same as 25%-40%. The percentage increase in correct answers? 25%-30% (20% increase) would be equivalent to 75%-90%.
I'm still reluctant to accept class grades and state-mandated graduation test scores as measuring primarily intelligence or even mastery of the material, rather than the specific skill of taking the test. That makes my error bars larger than those of someone who does accept them as accurate measurements of something important.
No, usually in these cases you will be using an effect size like Cohen's d: expressing the difference in standard deviations (on the raw score) between the two groups. You can convert it back to IQ points if you want; if you discover a d of 1.0, that's boosting scores by 1 standard deviation which is usually defined as something like 15 IQ points, and so on.
So if you have your standard paradigmatic experiment (an equal number of controls and experimentals, the two groups having exactly the same beginning mean IQ and standard deviation of the scores), you'd do your intervention, do a retest of IQ, and your effect size would be 'IQ(bigger) - IQ(smaller) / standard deviation of experimentals & controls'. Some of the things that these approaches do:
Effect sizes are also the sine qua non of meta-analyses, so by thinking in effect sizes, you can more easily run a meta-analysis yourself if you want (like my own dual n-back meta-analysis on a widely-touted intervention which is supposed to increase IQ), you can interpret meta-analyses better, and you can draw on previous meta-analyses as priors (example: Jaeggi et al 2008 found n-back had an effect size on IQ of something like d=0.8. If one had seen one of the psychology-wide compilations of previous meta-analyses, one would know that replicated & verified effect sizes that large are pretty rare in every area of psychology, and so it was highly likely that their result was being overstated somehow, as indeed it turned out to have been overstated due to a use of passive control groups, and the current best estimate is closer to half that size or d=0.4).
If IQ is the main cause of getting high class grades and passing cutoffs on tests and being able to learn test-taking skills (like learning any other skill), then couldn't the tests be measuring all of them simultaneously?
... For some reason I thought the first test was used to evenly distribute performance on the pretest between the two groups. Aren't the control and experimental groups supposed to be as close to identical as possible, and to help analysis identify which subgroups, if any, had effects different from other subgroups? If an intervention showed significantly different results for tall people than for short people, then a study of that intervention on people based on height may be indicated.
That's carryover from a different branch, sorry.
Ideally, yes, but if you shuffle people around, you're not necessarily doing yourself any favors. (I think. This seems to be related to an old debate in experimental design going back to Gosset and Fisher over 'balanced' versus 'randomized' designs, which I don't understand very well.)
This is part of the randomized vs balanced design debate. Suppose tall people did better, but you just randomly allocated people; with a small sample like, say, 10 total and 5 in each, you would expect to wind up with different numbers of tall people in your control and experimentals (eg a 4-1 split of 5 tall people), and now that may be driving the difference. If you were using a large sample like 5000 people, then you'd expect the random allocation to be very even between the two groups of 2500.
If you specify in advance that tall people are a possibility, you can try to 'balance' the groups by additional steps: for example, you might randomize short people as usual, but block (randomize) pairs of tall people - if heads, the guy on the left is in the experimental and right in control, if tails, other way around - where by definition you get an even split of tall people (and maybe 1 guy left over). This is fine, sensible, and efficient use of your sample, and if you're testing additional hypotheses like 'tall people score better, even on top of the intervention', you'll take appropriate measures like increasing your sample size to reach your desired statistical power / alpha parameters. No problems there.
But any post hoc analysis can be abused. If after you run your study you decide to look at how tall people did, you may have an unbalanced split driving any result, you're increasing how many hypotheses you're testing, and so on. Post hoc analyses are untrustworthy and suspicious; here's an example where a post hoc analysis was done: http://lesswrong.com/lw/68k/nback_news_jaeggi_2011_or_is_there_a/