Douglas_Knight comments on Vegetarianism Ideological Turing Test Results - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (26)
I think the first of these claims is a little too pessimistic, and the second may be too.
Here are some comments made by one of the judges (full disclosure: it was me) at the time. "I found these very difficult [...] I had much the same problem [sc. that pretty much every entry felt >50% credible]. [...] almost all my estimates were 40%-60% [...] I fear that this one [...] is just too difficult." I'm pretty sure (though of course memory is deceptive) that I would not have said that I thought myself "decently able to discern genuine writing from fakery". ("Almost all" was too strong, though, if I've correctly guessed which row in the table is mine. Four of my estimates were 70%. One was 99% but that's OK because that was my own entry, which I recognized. The others were all 40-60%. Incidentally, I got two of my four 70% guesses right and two wrong, and four of my eight 40%/60% guesses right and four wrong.)
On the second, I remark that judge 14 (full disclosure: this was definitely not me) scored better than +450 and got only two of the 13 entries wrong. The probability of any given judge getting 11/13 or better by chance is about 1%. [EDITED to add: As Douglas_Knight points out, it would be better to say 10/12 because judge 14 guessed 50% for one entry.] In a sample of 53 people you'll get someone doing this well just by chance a little over half the time. But wait, the two wrong ones were both 60/40 judgements, and judge 14 had a bunch of 70s and 80s and one 90 as well, all of them correct. With judge 14's probability assignments and random actual results, simulation (I'm too lazy to do it analytically) says that as good a logarithmic score happens only about 0.3% of the time. To figure out exactly what that says about the overall results we'd need some kind of probabilistic model for how people assign their probabilities or something, and I'm way too lazy for that, but my feeling is that judge 14's results are good enough to suggest genuinely better-than-chance performance.
If anyone wants to own up to being judge 14, I'd be extremely interested to hear what they have to say about their mental processes while judging.
The judge in row 14 did not get 11/13, but 10/12, having punted on #8 by assigning 50%. This affects at least your first calculation.
Good catch. But it's the second calculation that I find more interesting.
There is also a fair chance that that judge recognized at least one of their own entries... 9/11?