every single judge thought themselves decently able to discern genuine writing from fakery. The numbers suggest that every single judge was wrong.
I think the first of these claims is a little too pessimistic, and the second may be too.
Here are some comments made by one of the judges (full disclosure: it was me) at the time. "I found these very difficult [...] I had much the same problem [sc. that pretty much every entry felt >50% credible]. [...] almost all my estimates were 40%-60% [...] I fear that this one [...] is just too difficult." I'm pretty sure (though of course memory is deceptive) that I would not have said that I thought myself "decently able to discern genuine writing from fakery". ("Almost all" was too strong, though, if I've correctly guessed which row in the table is mine. Four of my estimates were 70%. One was 99% but that's OK because that was my own entry, which I recognized. The others were all 40-60%. Incidentally, I got two of my four 70% guesses right and two wrong, and four of my eight 40%/60% guesses right and four wrong.)
On the second, I remark that judge 14 (full disclosure: this was definitely not me) scored better than +450 and got only two of the 13 entries wrong. The probability of any given judge getting 11/13 or better by chance is about 1%. [EDITED to add: As Douglas_Knight points out, it would be better to say 10/12 because judge 14 guessed 50% for one entry.] In a sample of 53 people you'll get someone doing this well just by chance a little over half the time. But wait, the two wrong ones were both 60/40 judgements, and judge 14 had a bunch of 70s and 80s and one 90 as well, all of them correct. With judge 14's probability assignments and random actual results, simulation (I'm too lazy to do it analytically) says that as good a logarithmic score happens only about 0.3% of the time. To figure out exactly what that says about the overall results we'd need some kind of probabilistic model for how people assign their probabilities or something, and I'm way too lazy for that, but my feeling is that judge 14's results are good enough to suggest genuinely better-than-chance performance.
If anyone wants to own up to being judge 14, I'd be extremely interested to hear what they have to say about their mental processes while judging.
As Douglas_Knight points out, it's only 10/12, a probability of ~0.016. In a sample of ~50 we should see about one person at that level of accuracy or inaccuracy, which is exactly what we see. I'm no more inclined to give #14 a medal than I am to call #43 a dunce. See the histogram I stuck on to the end of the post for more intuition about why I see these extreme results as normal.
I absolutely will fess up to exaggerating in that sentence for the sake of dramatic effect. Some judges, such as yourself, were MUCH less wrong. I hope you don't mind me outing y...
Back in August I ran a Caplan Test (or more commonly an "Ideological Turing Test") both on Less Wrong and in my local rationality meetup. The topic was diet, specifically: Vegetarian or Omnivore?
If you're not familiar with Caplan Tests, I suggest reading Palladias' post on the subject or reading Wikipedia. The test I ran was pretty standard; thirteen blurbs were presented to the judges, selected by the toss of a coin to either be from a vegetarian or from an omnivore, and also randomly selected to be genuine or an impostor trying to pass themselves off as the alternative. My main contribution, which I haven't seen in previous tests, was using credence/probability instead of a simple "I think they're X".
I originally chose vegetarianism because I felt like it's an issue which splits our community (and particularly my local community) pretty well. A third of test participants were vegetarians, and according to the 2014 census, only 56% of LWers identify as omnivores.
Before you see the results of the test, please take a moment to say aloud how well you think you can do at predicting whether someone participating in the test was genuine or a fake.
.
.
.
.
.
.
.
.
.
.
.
.
.
If you think you can do better than chance you're probably fooling yourself. If you think you can do significantly better than chance you're almost certainly wrong. Here are some statistics to back that claim up.
I got 53 people to judge the test. 43 were from LessWrong, and 10 were from my local group. Averaging across the entire group, 51.1% of judgments were correct. If my Chi^2 math is correct, the p-value for the null hypothesis is 57% on this data. (Note that this includes people who judged an entry as 50%. If we don't include those folks the success rate drops to 49.4%.)
In retrospect, this seemed rather obvious to me. Vegetarians aren't significantly different from omnivores. Unlike a religion or a political party there aren't many cultural centerpieces to diet. Vegetarian judges did no better than omnivore judges, even when judging vegetarian entries. In other words, in this instance the minority doesn't possess any special powers for detecting other members of the in-group. This test shows null results; the thing that distinguishes vegetarians from omnivores is not familiarity with the other sides' arguments or culture, at least not to the degree that we can distinguish at a glance.
More interesting, in my opinion, than the null results were the results I got on the calibration of the judges. Back when I asked you to say aloud how good you'd be, what did you say? Did the last three paragraphs seem obvious? Would it surprise you to learn that not a single one of the 53 judges held their guesses to a confidence band of 40%-60%? In other words, every single judge thought themselves decently able to discern genuine writing from fakery. The numbers suggest that every single judge was wrong.
(The flip-side to this is, of course, that every entrant to the test won! Congratulations rationalists: signs point to you being able to pass as vegetarians/omnivores when you try, even if you're not in that category. The average credibility of an impostor entry was 59%, while the average credibility of a genuine response was 55%. No impostors got an average credibility below 49%.)
Using the logarithmic scoring rule for the calibration game we can measure the error of the community. The average judge got a score of -543. For comparison, a judge that answered 50% ("I don't know") to all questions would've gotten a score of 0. Only eight judges got a positive score, and only one had a score higher than 100 (consistent with random chance). This is actually one area where Less Wrong should feel good. We're not at all calibrated... but for this test at least, the judges from the website were much better calibrated than my local community (who mostly just lurk). If we separate the two groups we see that the average score for my community was -949, while LW had an average of -448. Given that I restricted the choices to multiples of 10, a random selection of credences gives an average score of -921.
In short, the LW community didn't prove to be any better at discerning fact from fiction, but it was significantly less overconfident. More de-biasing needs to be done, however! The next time you think of a probability to reflect your credence, ask yourself "Is this the sort of thing that anyone would know? Is this the sort of thing I would know?" That answer will probably be "no" a lot more than it feels like from the inside.
Full data (minus contact info) can be found here.
Those of you who submitted a piece of writing that I used, or who judged the test and left their contact information: I will be sending out personal scores very soon (probably by this weekend). Deep apologies regarding the delay on this post. I had a vacation in late August and it threw off my attention to this project.
EDIT: Here's a histogram of the identification accuracy.
EDIT 2: For reference, here are the entries that were judged.