Measuring the outcome is good, but I see a problem with the original data. How do you know who is really Green and who is really Blue?
By their self-reports, right?
Well, I see a problem here. What if someone insists on self-describing as a Blue, but most Blues disagree with him and say he is completely confused about what Blue-ness is? -- I know the definition of Blue is not exact, but it at least roughly corresponds to something in the idea-space, and a person can get it wrong and self-identify as a Blue despite being somewhere else. (Perhaps somewhere beyond both typical Blue and Green areas, so the person self-identifies as a Blue simply because they use Blue as a synonym for non-Green.) -- If other people fail to recognize such person as a Blue, is it really their fault?
The question is not exactly "whom to blame?", but rather "if we use noisy inputs and then get noisy outputs, does it tell us something beyond the fact that there was a noise in input?"
(To be specific, I remember someone in the ideological test saying that they self-identify as both Christian and Atheist. And it was 1 person in 13, so that has a non-trivial impact on the results. I don't think that majority of either Christians or Atheists would agree that an opinion like this is a valid representation of their opinions. So how exactly should guessing or not guessing this person's self-description influence the ratings? And should it influence the ratings if the same person would be forced to choose only one of the descriptions?)
What if someone insists on self-describing as a Blue, but most Blues disagree with him and say he is completely confused about what Blue-ness is?
Sometimes there are many tinges of Blues. And for almost every tinge you pick, most other Blues will claim people of that tinge are not really Blue. (Religious and ideological movements get like this a lot.) But Greens have no problem classifying people as Blue and non-Blue, so it's not a wholly useless concept.
I recently gave a talk at Chicago Ideas Week on adapting Turing Tests to have better, less mindkill-y arguments, and this is the precis for folks who would prefer not to sit through the video (which is available here).
Conventional Turing Tests check whether a programmer can build a convincing facsimile of a human conversationalist. The test has turned out to reveal less about machine intelligence than human intelligence. (Anger is really easy to fake, since fights can end up a little more Markov chain-y, where you only need to reply to the most recent rejoinder and can ignore what came before). Since normal Turing Tests made us think more about our model of human conversation, economist Bryan Caplan came up with a way to use them to make us think more usefully about our models of our enemies.
After Paul Krugman disparaged Caplan's brand of libertarian economics, Caplan challenged him to an ideological Turing Test, where both players would be human, but would be trying to accurately imitate each other. Caplan and Krugman would each answer questions about their true beliefs honestly, and then would fill out the questionaire again in persona inimici - trying to guess the answers given by the other side. Caplan was willing to bet that he understood Krugman's position well enough to mimic it, but Krugman would be easily spotted as a fake!Caplan.
Krugman didn't take him up on the offer, but I've run a couple iterations of the test for my religion/philosophy blog. The first year, some of the most interesting results were the proxy variables people were using, that weren't as strong as indicators as the judges thought. (One Catholic coasted through to victory as a faux atheist, since many of the atheist judges thought there was no way a Christian would appreciate the webcomic SMBC).
The trouble was, the Christians did a lot better, since it turned out I had written boring, easy to guess questions for the true and faux atheists. The second year, I wrote weirder questions, and the answers were a lot more diverse and surprising (and a number of the atheist participants called out each other as fakes or just plain wrong, since we'd gotten past the shallow questions from year one, and there's a lot of philosophical diversity within atheism).
The exercise made people get curious about what it was their opponents actually thought and why. It helped people spot incorrect stereotypes of an opposing side and faultlines they'd been ignoring within their own. Personally, (and according to other participants) it helped me have an argument less antagonistically. Instead of just trying to find enough of a weak point to discomfit my opponent, I was trying to build up a model of how they thought, and I needed their help to do it.
Taking a calm, inquisitive look at an opponent's position might teach me that my position is wrong, or has a gap I need to investigate. But even if my opponent is just as wrong as zer seemed, there's still a benefit to me. Having a really detailed, accurate model of zer position may help me show them why it's wrong, since now I can see exactly where it rasps against reality. And even if my conversation isn't helpful to them, it's interesting for me to see what they were missing. I may be correct in this particular argument, but the odds are good that I share the rationalist weak-point that is keeping them from noticing the error. I'd like to be able to see it more clearly so I can try and spot it in my own thought. (Think of this as the shift from "How the hell can you be so dumb?!" to "How the hell can you be so dumb?").
When I get angry, I'm satisfied when I beat my interlocutor. When I get curious, I'm only satisfied when I learn something new.