Back in August I ran a Caplan Test (or more commonly an "Ideological Turing Test") both on Less Wrong and in my local rationality meetup. The topic was diet, specifically: Vegetarian or Omnivore?

If you're not familiar with Caplan Tests, I suggest reading Palladias' post on the subject or reading Wikipedia. The test I ran was pretty standard; thirteen blurbs were presented to the judges, selected by the toss of a coin to either be from a vegetarian or from an omnivore, and also randomly selected to be genuine or an impostor trying to pass themselves off as the alternative. My main contribution, which I haven't seen in previous tests, was using credence/probability instead of a simple "I think they're X".

I originally chose vegetarianism because I felt like it's an issue which splits our community (and particularly my local community) pretty well. A third of test participants were vegetarians, and according to the 2014 census, only 56% of LWers identify as omnivores.

Before you see the results of the test, please take a moment to say aloud how well you think you can do at predicting whether someone participating in the test was genuine or a fake.

.

.

.

.

.

.

.

.

.

.

.

.

.

If you think you can do better than chance you're probably fooling yourself. If you think you can do significantly better than chance you're almost certainly wrong. Here are some statistics to back that claim up.

I got 53 people to judge the test. 43 were from LessWrong, and 10 were from my local group. Averaging across the entire group, 51.1% of judgments were correct. If my Chi^2 math is correct, the p-value for the null hypothesis is 57% on this data. (Note that this includes people who judged an entry as 50%. If we don't include those folks the success rate drops to 49.4%.)

In retrospect, this seemed rather obvious to me. Vegetarians aren't significantly different from omnivores. Unlike a religion or a political party there aren't many cultural centerpieces to diet. Vegetarian judges did no better than omnivore judges, even when judging vegetarian entries. In other words, in this instance the minority doesn't possess any special powers for detecting other members of the in-group. This test shows null results; the thing that distinguishes vegetarians from omnivores is not familiarity with the other sides' arguments or culture, at least not to the degree that we can distinguish at a glance.

More interesting, in my opinion, than the null results were the results I got on the calibration of the judges. Back when I asked you to say aloud how good you'd be, what did you say? Did the last three paragraphs seem obvious? Would it surprise you to learn that not a single one of the 53 judges held their guesses to a confidence band of 40%-60%? In other words, every single judge thought themselves decently able to discern genuine writing from fakery. The numbers suggest that every single judge was wrong.

(The flip-side to this is, of course, that every entrant to the test won! Congratulations rationalists: signs point to you being able to pass as vegetarians/omnivores when you try, even if you're not in that category. The average credibility of an impostor entry was 59%, while the average credibility of a genuine response was 55%. No impostors got an average credibility below 49%.)

Using the logarithmic scoring rule for the calibration game we can measure the error of the community. The average judge got a score of -543. For comparison, a judge that answered 50% ("I don't know") to all questions would've gotten a score of 0. Only eight judges got a positive score, and only one had a score higher than 100 (consistent with random chance). This is actually one area where Less Wrong should feel good. We're not at all calibrated... but for this test at least, the judges from the website were much better calibrated than my local community (who mostly just lurk). If we separate the two groups we see that the average score for my community was -949, while LW had an average of -448. Given that I restricted the choices to multiples of 10, a random selection of credences gives an average score of -921.

In short, the LW community didn't prove to be any better at discerning fact from fiction, but it was significantly less overconfident. More de-biasing needs to be done, however! The next time you think of a probability to reflect your credence, ask yourself "Is this the sort of thing that anyone would know? Is this the sort of thing I would know?" That answer will probably be "no" a lot more than it feels like from the inside.

Full data (minus contact info) can be found here.

Those of you who submitted a piece of writing that I used, or who judged the test and left their contact information: I will be sending out personal scores very soon (probably by this weekend). Deep apologies regarding the delay on this post. I had a vacation in late August and it threw off my attention to this project.

EDIT: Here's a histogram of the identification accuracy. 

Histogram

 

EDIT 2: For reference, here are the entries that were judged.

New to LessWrong?

New Comment
26 comments, sorted by Click to highlight new comments since: Today at 3:40 PM

Interesting update/realization I just had:

I was one of the people putting down "40%" or "60%" type answers - despite also thinking "man, I'm really not sure here." But "40%/60%" feels like a number that's "sufficiently" uncertain to represent a rough mental state of "not really sure, but if I had to pick I'd go with meat-eater." When it fact if anything I should have been putting down 49%/51% at best, because seriously it was genuinely hard to tell.

I also remember the essays being very weird and throwing me off a bit. Like, in the normal population, vegetarians say "animals can feel pain and we shouldn't hurt or kill them" and meat eaters say "we're at the top of the food chain and it's the natural order of things for us to eat."

On less wrong, vegetarians say things like "animals feel pain, therefore factory farming is evil but also we probably should destroy all natural habitat so that wild animals can't exist or suffer. Also maybe we should just eradicate all life on earth to prevent suffering." And meat eaters say "I look forward to tasty cost effective meat grown in test tubes but right now I'm too lazy and don't care that much. Also, probably better to have slightly-better farming conditions but then continue creating billions of chickens for a net hedonic gain."

It was pretty clear that any lessons I took on identifying vegetarians/meat-eaters from this test were pretty localized.

I should have been putting down 49%/51% at best

But we didn't have that option!

I suspect that at least some judges (including me, though I'm reconstructing rather than actually recalling my thought processes) (1) used 40/60 to indicate "meh, scarcely any idea but I lean this way rather than that" and then (2) felt like they had to use 30/70 for opinions one notch stronger, even though if evaluating them in a vacuum they might have chosen something more like 40% or 60% to represent them.

(In my case, at least, this doesn't make me look much better; even aside from the fact that that isn't how you should assign probabilities, I got exactly half of my eight 40/60 judgements right, and also exactly half of my four 30/70 judgements. I suppose that means I'm consistent, but not in a good way.)

In retrospect I ought to have included options closer to 50%. I didn't expect that they'd be so necessary! You are absolutely right, though.

A big part of LessWrong, I think, is learning to overcome our mental failings. Perhaps we can use this as a lesson that the best judge writes down their credence before seeing the options, then picks the option that is the best match to what they wrote. I know that I, personally, try (and often fail) to use this technique when doing multiple-choice tests.

Yup.

I do a similar thing when contemplating buying things like books: I don't look at the price, then I ask myself "How much would I pay to have this?", then I check the price. (And then, if it's a book, I too often buy the damn thing anyway even though the price is higher than the one I decided on. Perfectly rational action is, alas, beyond me.)

Looking at the price gives you information about how valuable and/or rare the thing is, which may in turn affect what price you are willing to pay in such a way that cannot be captured by guessing a price beforehand.

Of course this can be gamed by sellers in specific situations, but sellers who did that every single time would gain a reputation for low-information pricing, which would limit how many sellers actually do that.

Looking at the price gives you information

Yes, but in the cases I have in mind (e.g., buying a book) it gives very little extra information. If I've already had a good look inside a book to assess how interesting the subject matter is, how clearly things are explained, how well written the prose is, how competent the typography is, what sort of paper it's printed on, etc., knowing the price will tell me basically nothing more about how valuable the book will be to me.

I would not encourage this technique for goods whose present and future price are largely driven by scarcity and whose value to you is substantially affected by your prospects of selling them in the future. But if you're buying such goods, you should be getting scarcity information from other places besides the current asking price of the particular instance now before you.

every single judge thought themselves decently able to discern genuine writing from fakery. The numbers suggest that every single judge was wrong.

I think the first of these claims is a little too pessimistic, and the second may be too.

Here are some comments made by one of the judges (full disclosure: it was me) at the time. "I found these very difficult [...] I had much the same problem [sc. that pretty much every entry felt >50% credible]. [...] almost all my estimates were 40%-60% [...] I fear that this one [...] is just too difficult." I'm pretty sure (though of course memory is deceptive) that I would not have said that I thought myself "decently able to discern genuine writing from fakery". ("Almost all" was too strong, though, if I've correctly guessed which row in the table is mine. Four of my estimates were 70%. One was 99% but that's OK because that was my own entry, which I recognized. The others were all 40-60%. Incidentally, I got two of my four 70% guesses right and two wrong, and four of my eight 40%/60% guesses right and four wrong.)

On the second, I remark that judge 14 (full disclosure: this was definitely not me) scored better than +450 and got only two of the 13 entries wrong. The probability of any given judge getting 11/13 or better by chance is about 1%. [EDITED to add: As Douglas_Knight points out, it would be better to say 10/12 because judge 14 guessed 50% for one entry.] In a sample of 53 people you'll get someone doing this well just by chance a little over half the time. But wait, the two wrong ones were both 60/40 judgements, and judge 14 had a bunch of 70s and 80s and one 90 as well, all of them correct. With judge 14's probability assignments and random actual results, simulation (I'm too lazy to do it analytically) says that as good a logarithmic score happens only about 0.3% of the time. To figure out exactly what that says about the overall results we'd need some kind of probabilistic model for how people assign their probabilities or something, and I'm way too lazy for that, but my feeling is that judge 14's results are good enough to suggest genuinely better-than-chance performance.

If anyone wants to own up to being judge 14, I'd be extremely interested to hear what they have to say about their mental processes while judging.

As Douglas_Knight points out, it's only 10/12, a probability of ~0.016. In a sample of ~50 we should see about one person at that level of accuracy or inaccuracy, which is exactly what we see. I'm no more inclined to give #14 a medal than I am to call #43 a dunce. See the histogram I stuck on to the end of the post for more intuition about why I see these extreme results as normal.

I absolutely will fess up to exaggerating in that sentence for the sake of dramatic effect. Some judges, such as yourself, were MUCH less wrong. I hope you don't mind me outing you as one of the people who got a positive score, and that's a reflection of you being better calibrated. That said, if you say "I'm 70% confident" four times, and only get it right twice, that's evidence that you were still (slightly) overconfident when you thought "decently able to discern genuine writing from fakery".

I'm #43 and I'll accept my dunce cap. I responded just after I began lurking here. I remember having little confidence in my responses and yet I apparently answered as if I did. I really have no insight into why I answered this way. My cringeworthy results reinforce to me the importance of sticking around and improving my thinking.

that's a reflection of you being better calibrated

Or, of course, just lucky. If you aren't giving #14 a medal, you shouldn't be giving me one either. (Though, as it happens, I have some reason to think my calibration is pretty good.) And yes, I was still slightly overconfident, and my intention in what I wrote above was to make it clear that I recognize that.

The judge in row 14 did not get 11/13, but 10/12, having punted on #8 by assigning 50%. This affects at least your first calculation.

Good catch. But it's the second calculation that I find more interesting.

There is also a fair chance that that judge recognized at least one of their own entries... 9/11?

Possible confounding variable: people who don't think they can tell the difference might be less likely to do the test. I remember Brienne doing something similar on facebook. I read a couple of entries, spent a while agonizing about whether I thought they were slightly more likely to be genuine or fake, got bored/frustrated and gave up.

One thing that surprised me when looking at the data, is it appears that omnivores did slightly better at getting the answers 'right' (as determined by a simple greater or less than 50% comparison). I would have thought the vegetarians would do better, as they would be more familiar with the in-group terminology. That said, I have no clue if the numbers are even significant given the size of the group, so I wouldn't read too much into it. (Apologize in advance for awful formatting)

Number 'correct' - 1  2  3  4  5  6  7  8  9  10 Grand Total
Omnivore---------- 1  0  1  5  3  8  7  3  0  1      29
Vegetarian-------- 0  0  2  1  5  4  2  0  0  0      14

You're right, but I'm pretty confident that the difference isn't significant. We should probably see it as evidence that rationalists omnivores are about as capable as rationalist vegetarians.

If we look at average percent of positive predictions (predictions that earn more than 0 points):

Omnivores: 51%

Vegetarians: 46%

If we look at non-negative predictions (counting 50% predictions):

Omnivores: 52%

Vegetarians: 49%

This is a very good point, and I ought to have mentioned it in the post. The point remains about overconfidence, however. Those who did decide to try (even given that it was hard) didn't have the mental red-flag that perhaps their best try should be saying "I don't know" with or without walking away.

I think you should distinguish between "average score across judges is close to 50%" and "every single judge is close to 50%". I suspect the latter is not true, as pointed out in one of the other comments.

Every judge being close to 50% would be bizarre. If I flip 13 coins 53 times I would expect that many of those sets of 13 will stray from the 6.5/13 expected ratio. The big question is whether anyone scored high enough or low enough that we can say "this wasn't just pure chance".

Yes, I agree, I meant the (unobserved) probability that each judge gets a given question correct (which will of course differ from the observed fraction of the time each judge is correct. But it appears that at least one judge may have done quite well (as gjm points out). I don't think that the analysis done so far provides much evidence about how many judges are doing better than chance. It's possible that there just isn't enough data to make such an inference, but one possible thing you could do is to plot the p-values in ascending order and see how close they come to a straight line.

I really appreciate the effort that this took and I think this is an interesting and valuable result which will influence my actions in the future.

After reading the first part of your post, I decided that I would have a 55% chance of correctly predicting the positions of Less Wrongers but a 90% chance of correctly predicting the position of an average sample of Americans. I hereby pat myself on the back for my correct calibration on the prior claim, and wonder how far off my latter claim might be. A lot of the arguments that I encounter in daily life (e.g. bar-b-ques where I refrain from eating meat) are pretty crummy.

According to the PM I got, I had the most credible vegetarian entry, and it was ranked as much more credible than my actual (meat-eating) beliefs. I'm not sure how I feel about that.

Impostor entries were generally more convincing than genuine responses. I chalk this up to impostors trying harder to convince judges.

But who knows? Maybe you were a vegetarian in a past life! ;)

Doesn't surprise me, but I lean towards the opposite reasoning. I think the majority of people understand vegetarian/vegan arguments, so the imposters don't have any kind of disadvantage - but vegetarian/vegan people likely think the majority of people don't understand those arguments (or else why wouldn't they arrive at the same conclusions), which results in a miscalibration about how to represent their beliefs to people.

ETA: Likewise the reverse.

On test this hard you didn't even have an option between 50 and 60%? Weak!

Or rather, allow them to be weaker!

This is basically what I expected with an ITT on this sort of opinion, and is the reason I didn't bother to participate.