Summary

CFAR included 5 questions on the 2012 LW Survey which were adapted from the heuristics and biases literature, based on five different cognitive biases or reasoning errors.  LWers, on the whole, showed less bias than is typical in the published research (on all 4 questions where this was testable), but did show clear evidence of bias on 2-3 of those 4 questions.  Further, those with closer ties to the LW community (e.g., those who had read more of the sequences) showed significantly less bias than those with weaker ties (on 3 out of 4-5 questions where that was testable).  These results all held when controlling for measures of intelligence.


METHOD & RESULTS

Being less susceptible to cognitive biases or reasoning errors is one sign of rationality (see the work of Keith Stanovich & his colleagues, for example).  You'd hope that a community dedicated to rationality would be less prone to these biases, so I selected 5 cognitive biases and reasoning errors from the heuristics & biases literature to include on the LW survey.  There are two possible patterns of results which would point in this direction:

  • high scores: LWers show less bias than other populations that have answered these questions (like students at top universities) 
  • correlation with strength of LW exposure: those who have read the sequences (or have been around LW a long time, have high karma, attend meetups, make posts) score better than those who have not. 

The 5 biases were selected in part because they can be tested with everyone answering the same questions; I also preferred biases that haven't been discussed in detail on LW.  On some questions there is a definitive wrong answer and on others there is reason to believe that a bias will tend to lead people towards one answer (so that, even though there might be good reasons for a person to choose that answer, in the aggregate it is evidence of bias if more people choose that answer).

This is only one quick, rough survey.  If the results are as predicted, that could be because LW makes people more rational, or because LW makes people more familiar with the heuristics & biases literature (including how to avoid falling for the standard tricks used to test for biases), or because the people who are attracted to LW are already unusually rational (or just unusually good at avoiding standard biases).  Susceptibility to standard biases is just one angle on rationality.  Etc.

Here are the question-by-question results, in brief.  The next section contains the exact text of the questions, and more detailed explanations.

Question 1 was a disjunctive reasoning task, which had a definitive correct answer.  Only 13% of undergraduates got the answer right in the published paper that I took it from.  46% of LWers got it right, which is much better but still a very high error rate.  Accuracy was 58% for those high in LW exposure vs. 31% for those low in LW exposure.  So for this question, that's: 
1. LWers biased: yes 
2. LWers less biased than others: yes 
3. Less bias with more LW exposure: yes 

Question 2 was a temporal discounting question; in the original paper about half the subjects chose money-now (which reflects a very high discount rate).  Only 8% of LWers did; that did not leave much room for differences among LWers (and there was only a weak & nonsignificant trend in the predicted direction). So for this question: 
1. LWers biased: not really 
2. LWers less biased than others: yes 
3. Less bias with more LW exposure: n/a (or no) 

Question 3 was about the law of large numbers.  Only 22% got it right in Tversky & Kahneman's original paper. 84% of LWers did: 93% of those high in LW exposure, 75% of those low in LW exposure.  So: 
1. LWers biased: a bit 
2. LWers less biased than others: yes 
3. Less bias with more LW exposure: yes 

Question 4 was based on the decoy effect aka asymmetric dominance aka attraction effect (but missing a control condition).  I don't have numbers from the original study (and there is no correct answer) so I can't really answer 1 or 2 for this question, but there was a difference based on LW exposure: 57% vs. 44% selecting the less bias related answer. 
1. LWers biased: n/a 
2. LWers less biased than others: n/a 
3. Less bias with more LW exposure: yes 

Question 5 was an anchoring question.  The original study found an effect (measured by slope) of 0.55 (though it was less transparent about the randomness of the anchor; transparent studies w. other questions have found effects around 0.3 on average).  For LWers there was a significant anchoring effect but it was only 0.14 in magnitude, and it did not vary based on LW exposure (there was a weak & nonsignificant trend in the wrong direction). 
1. LWers biased: yes 
2. LWers less biased than others: yes 
3. Less bias with more LW exposure: no 

One thing you might wonder: how much of this is just intelligence?  There were several questions on the survey about performance on IQ tests or SATs.  Controlling for scores on those tests, all of the results about the effects of LW exposure held up nearly as strongly.  Intelligence test scores were also predictive of lower bias, independent of LW exposure, and those two relationships were almost the same in magnitude.  If we extrapolate the relationship between IQ scores and the 5 biases to someone with an IQ of 100 (on either of the 2 IQ measures), they are still less biased than the participants in the original study, which suggests that the "LWers less biased than others" effect is not based solely on IQ.

 

MORE DETAILED RESULTS

There were 5 questions related to strength of membership in the LW community which I standardized and combined into a single composite measure of LW exposure (LW use, sequence reading, time in community, karma, meetup attendance); this was the main predictor variable I used (time per day on LW also seems related, but I found out while analyzing last year's survey that it doesn't hang together with the others or associate the same way with other variables).  I analyzed the results using a continuous measure of LW exposure, but to simplify reporting, I'll give the results below by comparing those in the top third on this measure of LW exposure with those in the bottom third.

There were 5 intelligence-related measures which I combined into a single composite measure of Intelligence (SAT out of 2400, SAT out of 1600, ACT, previously-tested IQ, extra credit IQ test); I used this to control for intelligence and to compare the effects of LW exposure with the effects of Intelligence (for the latter, I did a similar split into thirds).  Sample sizes: 1101 people answered at least one of the CFAR questions; 1099 of those answered at least one LW exposure question and 835 of those answered at least one of the Intelligence questions.  Further details about method available on request.

Here are the results, question by question.

Question 1: Jack is looking at Anne, but Anne is looking at George. Jack is married but George is not. Is a married person looking at an unmarried person?

  • Yes 
  • No 
  • Cannot be determined 

This is a "disjunctive reasoning" question, which means that getting the correct answer requires using "or".  That is, it requires considering multiple scenarios.  In this case, either Anne is married or Anne is unmarried.  If Anne is married then married Anne is looking at unmarried George; if Anne is unmarried then married Jack is looking at unmarried Anne.  So the correct answer is "yes".  A study by Toplak & Stanovich (2002) of students at a large Canadian university found that only 13% correctly answered "yes" while 86% answered "cannot be determined" (2% answered "no").

On this LW survey, 46% of participants correctly answered "yes"; 54% chose "cannot be determined" (and 0.4% said"no").  Further, correct answers were much more common among those high in LW exposure: 58% of those in the top third of LW exposure answered "yes", vs. only 31% of those in the bottom third.  The effect remains nearly as big after controlling for Intelligence (the gap between the top third and the bottom third shrinks from 27% to 24% when Intelligence is included as a covariate).  The effect of LW exposure is very close in magnitude to the effect of Intelligence; 60% of those in the top third in Intelligence answered correctly vs. 37% of those in the bottom third.

original study: 13% 
weakly-tied LWers: 31% 
strongly-tied LWers: 58% 


Question 2: Would you prefer to receive $55 today or $75 in 60 days?

This is a temporal discounting question.  Preferring $55 today implies an extremely (and, for most people, implausibly) high discount rate, is often indicative of a pattern of discounting that involves preference reversals, and is correlated with other biases.  The question was used in a study by Kirby (2009) of undergraduates at Williams College (with a delay of 61 days instead of 60; I took it from a secondary source that said "60" without checking the original), and based on the graph of parameter values in that paper it looks like just under half of participants chose the larger later option of $75 in 61 days.

LW survey participants almost uniformly showed a low discount rate: 92% chose $75 in 61 days.  This is near ceiling, which didn't leave much room for differences among LWers.  For LW exposure, top third vs. bottom third was 93% vs. 90%, and this relationship was not statistically significant (p=.15); for Intelligence it was 96% vs. 91% and the relationship was statistically significant (p=.007).  (EDITED: I originally described the Intelligence result as nonsignificant.)

original study: ~47% 
weakly-tied LWers: 90% 
strongly-tied LWers: 93% 


Question 3: A certain town is served by two hospitals. In the larger hospital, about 45 babies are born each day. In the smaller one, about 15 babies are born each day. Although the overall proportion of girls is about 50%, the actual proportion at either hospital may be greater or less on any day. At the end of a year, which hospital will have the greater number of days on which more than 60% of the babies born were girls? 

  • The larger hospital 
  • The smaller hospital 
  • Neither - the number of these days will be about the same 

This is a statistical reasoning question, which requires applying the law of large numbers.  In Tversky & Kahneman's (1974) original paper, only 22% of participants correctly chose the smaller hospital; 57% said "about the same" and 22% chose the larger hospital.

On the LW survey, 84% of people correctly chose the smaller hospital; 15% said "about the same" and only 1% chose the larger hospital.  Further, this was strongly correlated with strength of LW exposure: 93% of those in the top third answered correctly vs. 75% of those in the bottom third.  As with #1, controlling for Intelligence barely changed this gap (shrinking it from 18% to 16%), and the measure of Intelligence produced a similarly sized gap: 90% for the top third vs. 79% for the bottom third.

original study: 22% 
weakly-tied LWers: 75% 
strongly-tied LWers: 93%


Question 4: Imagine that you are a doctor, and one of your patients suffers from migraine headaches that last about 3 hours and involve intense pain, nausea, dizziness, and hyper-sensitivity to bright lights and loud noises. The patient usually needs to lie quietly in a dark room until the headache passes. This patient has a migraine headache about 100 times each year. You are considering three medications that you could prescribe for this patient. The medications have similar side effects, but differ in effectiveness and cost. The patient has a low income and must pay the cost because her insurance plan does not cover any of these medications. Which medication would you be most likely to recommend? 

  • Drug A: reduces the number of headaches per year from 100 to 30. It costs $350 per year. 
  • Drug B: reduces the number of headaches per year from 100 to 50. It costs $100 per year. 
  • Drug C: reduces the number of headaches per year from 100 to 60. It costs $100 per year. 

This question is based on research on the decoy effect (aka "asymmetric dominance" or the "attraction effect").  Drug C is obviously worse than Drug B (it is strictly dominated by it) but it is not obviously worse than Drug A, which tends to make B look more attractive by comparison.  This is normally tested by comparing responses to the three-option question with a control group that gets a two-option question (removing option C), but I cut a corner and only included the three-option question.  The assumption is that more-biased people would make similar choices to unbiased people in the two-option question, and would be more likely to choose Drug B on the three-option question.  The model behind that assumption is that there are various reasons for choosing Drug A and Drug B; the three-option question gives biased people one more reason to choose Drug B but other than that the reasons are the same (on average) for more-biased people and unbiased people (and for the three-option question and the two-option question).

Based on the discussion on the original survey thread, this assumption might not be correct.  Cost-benefit reasoning seems to favor Drug A (and those with more LW exposure or higher intelligence might be more likely to run the numbers).  Part of the problem is that I didn't update the costs for inflation - the original problem appears to be from 1995 which means that the real price difference was over 1.5 times as big then.

I don't know the results from the original study; I found this particular example online (and edited it heavily for length) with a reference to Chapman & Malik (1995), but after looking for that paper I see that it's listed on Chapman's CV as only a "published abstract".

49% of LWers chose Drug A (the one that is more likely for unbiased reasoners), vs. 50% for Drug B (which benefits from the decoy effect) and 1% for Drug C (the decoy).  There was a strong effect of LW exposure: 57% of those in the top third chose Drug A vs. only 44% of those in the bottom third.  Again, this gap remained nearly the same when controlling for Intelligence (shrinking from 14% to 13%), and differences in Intelligence were associated with a similarly sized effect: 59% for the top third vs. 44% for the bottom third.

original study: ?? 
weakly-tied LWers: 44% 
strongly-tied LWers: 57% 


Question 5: Get a random three digit number (000-999) from http://goo.gl/x45un and enter the number here. 

Treat the three digit number that you just wrote down as a length, in feet. Is the height of the tallest redwood tree in the world more or less than the number that you wrote down? 

What is your best guess about the height of the tallest redwood tree in the world (in feet)?


This is an anchoring question; if there are anchoring effects then people's responses will be positively correlated with the random number they were given (and a regression analysis can estimate the size of the effect to compare with published results, which used two groups instead of a random number).

Asking a question with the answer in feet was a mistake which generated a great deal of controversy and discussion.  Dealing with unfamiliar units could interfere with answers in various ways so the safest approach is to look at only the US respondents; I'll also see if there are interaction effects based on country.

The question is from a paper by Jacowitz & Kahneman (1995), who provided anchors of 180 ft. and 1200 ft. to two groups and found mean estimates of 282 ft. and 844 ft., respectively.  One natural way of expressing the strength of an anchoring effect is as a slope (change in estimates divided by change in anchor values), which in this case is 562/1020 = 0.55.  However, that study did not explicitly lead participants through the randomization process like the LW survey did.  The classic Tversky & Kahneman (1974) anchoring question did use an explicit randomization procedure (spinning a wheel of fortune; though it was actually rigged to create two groups) and found a slope of 0.36.  Similarly, several studies by Ariely & colleagues (2003) which used the participant's Social Security number to explicitly randomize the anchor value found slopes averaging about 0.28.

There was a significant anchoring effect among US LWers (n=578), but it was much weaker, with a slope of only 0.14 (p=.0025).  That means that getting a random number that is 100 higher led to estimates that were 14 ft. higher, on average.  LW exposure did not moderate this effect (p=.88); looking at the pattern of results, if anything the anchoring effect was slightly higher among the top third (slope of 0.17) than among the bottom third (slope of 0.09). Intelligence did not moderate the results either (slope of 0.12 for both the top third and bottom third).  It's not relevant to this analysis, but in case you're curious, the median estimate was 350 ft. and the actual answer is 379.3 ft. (115.6 meters).

Among non-US LWers (n=397), the anchoring effect was slightly smaller in magnitude compared with US LWers (slope of 0.08), and not significantly different from the US LWers or from zero.

original study: slope of 0.55 (0.36 and 0.28 in similar studies) 
weakly-tied LWers: slope of 0.09 
strongly-tied LWers: slope of 0.17 


If we break the LW exposure variable down into its 5 components, every one of the five is strongly predictive of lower susceptibility to bias.  We can combine the first four CFAR questions into a composite measure of unbiasedness, by taking the percentage of questions on which a person gave the "correct" answer (the answer suggestive of lower bias).  Each component of LW exposure is correlated with lower bias on that measure, with r ranging from 0.18 (meetup attendance) to 0.23 (LW use), all p < .0001 (time per day on LW is uncorrelated with unbiasedness, r=0.03, p=.39).  For the composite LW exposure variable the correlation is 0.28; another way to express this relationship is that people one standard deviation above average on LW exposure 75% of CFAR questions "correct" while those one standard deviation below average got 61% "correct".  Alternatively, focusing on sequence-reading, the accuracy rates were:

75%    Nearly all of the Sequences (n = 302) 
70%    About 75% of the Sequences (n = 186) 
67%    About 50% of the Sequences (n = 156) 
64%    About 25% of the Sequences (n = 137) 
64%    Some, but less than 25% (n = 210) 
62%    Know they existed, but never looked at them (n = 19) 
57%    Never even knew they existed until this moment (n = 89) 

Another way to summarize is that, on 4 of the 5 questions (all but question 4 on the decoy effect) we can make comparisons to the results of previous research, and in all 4 cases LWers were much less susceptible to the bias or reasoning error.  On 1 of the 5 questions (question 2 on temporal discounting) there was a ceiling effect which made it extremely difficult to find differences within LWers; on 3 of the other 4 LWers with a strong connection to the LW community were much less susceptible to the bias or reasoning error than those with weaker ties.


REFERENCES
Ariely, Loewenstein, & Prelec (2003), "Coherent Arbitrariness: Stable demand curves without stable preferences" 
Chapman & Malik (1995), "The attraction effect in prescribing decisions and consumer choice" 
Jacowitz & Kahneman (1995), "Measures of Anchoring in Estimation Tasks" 
Kirby (2009), "One-year temporal stability of delay-discount rates" 
Toplak & Stanovich (2002), "The Domain Specificity and Generality of Disjunctive Reasoning: Searching for a Generalizable Critical Thinking Skill" 
Tversky & Kahneman's (1974), "Judgment under Uncertainty: Heuristics and Biases"

New to LessWrong?

New Comment
50 comments, sorted by Click to highlight new comments since: Today at 7:42 PM

I think the first question was either discussed in the sequences, or in a post sometime a while back. This makes the result for that question far less convincing, although the overall data still definitely shows a correlation.

Maybe there should be an box to check for "I have seen this problem before," so we could toss out those answers.

Maybe there should be an box to check for "I have seen this problem before," so we could toss out those answers.

Good idea. I'll plan on doing that on future surveys.

Pretty sure I skipped the problems I had seen before, FWIW.

As I pointed out on the LW survey discussion thread, the anchoring question was much more closely related to the "what is your height in centimeters?" question than to the random number.

I've redone the comparison with all US responders who also gave a height (n=468). Not only is the correlation much stronger (r = -0.35 versus r = 0.08), but (in a sense) the effect is greater as well: while the random number slope is 0.15 here (+100 random number means +15 feet guessed), the height slope is -11.5 (+1 centimeter of height means -11.5 feet guessed).

I don't know what this says about bias. But interestingly, this effect is almost entirely killed by significant exposure to LW: among the (n=82) of the responders tested above who also had a karma score of at least 500, the correlation between height and estimate is -0.02, which is negligible. Among the (n=137) responders with a karma score of at least 100, the correlation is -0.064, and the slope of the effect is only -2.2 feet/centimeter.

[This comment is no longer endorsed by its author]Reply

I just left a comment about this on the other thread. In brief, this height correlation seems to be driven by a single outlier who listed their own height as 13 cm and the height of the tallest redwood as 10,000 ft.

Well, crap.

Edit: this comment should probably be downvoted to -4 if anyone cares.

I realise I answered Question 1 (the marriage one) incorrectly. This is because I did not think of married/unmarried as exhaustive:

  • unmarried : has never been married
  • married : is married right now
  • divorced : was married, no longer married due to divorce
  • widow : was married , no longer married due to the partner's death

the intent of the question was to consider unmarried = all possibilities other than married.

Is this a language issue? Wikipedia indicates there's at least some controversy on the use of "unmarried"

edit : Ok, no, that's actually controversy on the use of "single".

Did you specifically think at the time "well, if 'married' and 'unmarried' were the only two possibilities, then the answer to the question would be 'yes' -- but Anne could also be divorced or a widow, in which case the answer would be 'no,' so I have to answer 'not enough information'"?

Not accusing you of dishonesty -- if you say you specifically thought of all that, I'll believe you -- but this seems suspiciously like a counter-factual justification, which I say only because I went through such a process. My immediate response on learning that I got the answer wrong was "well, 'unmarried' isn't necessarily coextensive with ~married,'" except then I realized that nothing like this occurred to me when I was actually answering, and that if I had thought in precisely these terms, I would have answered 'yes' and been quite proud of my own cleverness.

Regardless, for any potential future purposes, this problem could be addressed by changing "is a married person looking at an unmarried person?" to "is a married person looking at someone who is not married?" Doesn't seem like there's any reasonable ambiguity with the latter.

I recognise that it might be counter-factual justification. If I had explicitly wondered if "married/unmarried" were or were not exhaustive possibilities, I would have realised that the intent of the question was to treat them as exhaustive possibilities. The actual reasoning as I remember was "Only one of these people is known to be married, they are looking at someone of undetermined marital status". The step from "undetermined marital status" to "either married or unmarried" was not made, and, if you had asked me at the time, I might well have answered "could be divorced or something? .... wait wait of course the intent is to consider married/unmarried as exhaustive possibilities".

I am pretty sure that if the question had been

Three coins are lying on top of each other. The bottom coin lies heads-up, the top coin lies tails-up. Does a heads-up coin lie underneath a tails-up coin?

I would have answered correctly, probably because it pattern-matches in some way to "maths problem", where such reasoning is to be expected (not to say that such reasoning isn't universally applicable).

I want to say that I was thinking this too, but I think that is revisionist history on my part.

The only controversy I could find in that article was 'Some unmarried people object to describing themselves by a simplistic term "single".'

But certainly, the word-problem interpretation was also tested there. If the word-problem interpretation part of the test was harder and more sensitive than the test of disjunctive reasoning, the data would actually tell us about word-problem interpretation, not disjunctive reasoning :P

I meant specifically

Some unmarried people object to describing themselves by a simplistic term "single", and often other options are given, such as "divorced", "widowed", widow or widower, "cohabiting", "civil union", "domestic partnership" and "unmarried partners".

People saying this obviously aren't satisfied with a simple married/unmarried dichotomy.

(in looking this up I wasn't trying to obtain arguments against the way in which the question was posed, I just wanted to know if "unmarried" in English carries different connotations than it does in my native language. )

I just wanted to know if "unmarried" in English carries different connotations than it does in my native language. )

Apparently yes, it does.

People saying this obviously aren't satisfied with a simple married/unmarried dichotomy.

On the contrary, people referred to in that article aren't satisfied with a married/single dichotomy.

Apparently yes, it does.

Indeed. Though different dictionaries give both meanings, the Dutch bureau for statistics uses exclusively the "ongehuwd (literally: unmarried) = has never been married" meaning.

Less bias in the sense that they can answer these kinds of questions well. I'm uncertain how well rationality in these cases correlates with what we really care about, but it is clearly not super great.

These questions test for your ability to think carefully and apply knowledge of specific biases in a situation where you're prompted to "think rationally" and are being scored for not falling for tricks. If you're stuck asking survey questions, I'd ask much more personal questions like "how smart/rational are you compared to the LW distribution" (though there are problems here too).

I've met quite a few people through LW, and some were impressively rational in a very applied-to-real-life fashion. However, there's another group of about similar size that are terribly irrational when it comes to real life - engaging in motivated cognition and blind to that fact because of motivated cognition. Yet these people would probably ace this test, since they're quite good at not being explicitly and obviously wrong (they're good at looking smart). I'd guess this is more of a strength of identity effect than a "effective in real life" effect.

Here's a step-by-step process of how I did my analysis. This should include enough detail to get feedback or for others to do replications.

Step 1: I picked out the 17 relevant variables: 5 regarding LW exposure (LessWrongUse, KarmaScore, Sequences, TimeinCommunity, Meetups), 5 regarding intelligence (IQ, SATscoresoutof1600, SATscoresoutof2400, ACTscoreoutof36, IQTest), 6 from the CFAR questions (CFARQuestion1, CFARQuestion2, CFARQuestion3, CFARQuestion4, CFARQuestion5, CFARQuestion7), and Country.

Step 2: I cleaned up the data

  • removing or fixing non-numerical entries (e.g., on the anchoring question "~500" became 500, "100 Ft" became 100, "100m" became 328, and "we use metric in canada :)" became blank.
  • removing impossible responses (e.g., 1780 on SATscoresoutof1600)

Step 3: I coded categorical variables to make them numerical and transformed skewed continuous variables to make them close to normally distributed:

LessWrongUse: coded 1,2,3,4,5, where 1="I lurk, but never registered an account" and 5="I've posted in Main", and treated as a continuous variable KarmaScore: took ln(karma + 1) Sequences: coded 1,2,3,4,5,6,7, where 1="Never even knew they existed until this moment" and 5="Nearly all of the Sequences", and treated as a continuous variable TimeinCommunity: took sqrt Meetups: coded 0,1, and treated as a continuous variable

CFARQuestion1: coded "Yes" as 1, other options as 0 CFARQuestion2: coded "$75 in 60 days" as 1, other option as 0 CFARQuestion3: coded "The smaller hospital" as 1, other options as 0 CFARQuestion4: coded "Drug A" as 1, other options as 0 CFARQuestion5: I created 2 new versions, ln(anchor+1) and sqrt CFARQuestion7: I created a new version, taking the sqrt

Step 4: I removed a few outliers and wildly implausible responses on the continuous variables:

  • 3 on IQTest: 3, 18, and 66
  • 2 on height of tallest redwood in ft: 1 and 10,000

Step 5: I created composite scales of LW exposure, Intelligence, and Accuracy Rate.

For LW exposure, I verified that the 5 variables (after transformation) were positively correlated with each other (they were), standardized each variable (giving it mean=0 and stdev=1), and then averaged them together. People with missing data on some of the questions get the average of the questions that they did answer.

For Intelligence, I verified that the 5 variables were positively correlated with each other (they were) and standardized each variable (giving it mean=0 and stdev=1). I took advantage of the fact that 432 people responded to 2 or more of the questions to put them on the same scale. This probably isn't the best way to do it, but what I did was to run a principal components analysis. All 5 variables (unsurprisingly) loaded onto the same factor, so I saved that factor with imputation and called it "Intelligence". This gave to a score to everyone who answered at least one of the 5 questions, and was blank for those who answered none of them.

For Accuracy Rate, I averaged together the scores on CFAR questions 1-4. For each person who answered at least 1 of the questions, this gave the percent of the questions that they got "correct" (out of the ones that they answered).

Step 6: I analyzed the first 4 CFAR questions (and the overall Accuracy Rate). To find the percent of LWers who got the "correct" answer to each, I just looked at the mean. To test if LW exposure predicted "correct" responses, I ran logistic regressions with LW exposure as the independent variable and each question as the dependent variable (for Accuracy Rate, I did a linear regression). To test if Intelligence predicted "correct" responses, I repeated these analyses with Intelligence as the independent variable. To test if these effects were independent, I repeated the analyses as multiple regressions with both LW exposure and Intelligence as IVs.

Every analysis gave unambiguous results (effects that were highly significant or not close), and to make them more presentable I turned LW exposure into a categorical variable by splitting it into thirds, setting the cutoffs manually (n's of 389-390-389). I calculated the mean on each CFAR question for each group, and reported that (skipping the middle group), verifying that it matched the pattern of the analysis with the continuous variable. I did the same with Intelligence (n's 303-285-289). To calculate the reported "gaps while controlling for Intelligence", I ran multiple regressions with Intelligence as a continuous predictor variable and LW exposure as a categorical variable, and looked at the least squared means for the 3 levels of LW exposure.

Step 7: For the anchoring question, I looked only at participants who selected "United States" as their Country (due to issues with units). To test if there is an anchoring effect I ran a linear regression predicting the answer (question 7) based on the anchor value (question 5). First I looked at which transformations of the variables were closest to normally distributed; for question 5 that fell somewhere in between the original anchor value and sqrt(anchor) and for question 7 it fell in between sqrt(estimate) or ln(estimate+1). So I ran 4 linear regressions using each combination of those variables; they gave very similar results (statistically significant with similar R^2). Then I repeated the analysis with the most easily interpretable version fo the variables, the un-transformed ones, and it gave a similar result (with very slightly lower R^2) so I reported that one (since it had the most natural interpretation, a slope which can be compared to previous research).

To test if LW exposure moderated the strength of the anchoring effect, I ran a multiple regression predicting the estimate based on three continuous variables: anchor, LW exposure, and their interaction (anchor x LW exposure). The interaction effect (which would indicate whether strongly-tied LWers showed a weaker anchoring effect) was nonsignificant. I repeated this analysis with the more-normally-distributed transformed variables to be sure, and confirmed the result. I also repeated it with Intelligence in place of LW exposure, and found the same result. For reporting the results, I just ran a simple linear regression predicting estimates from anchor values, separately for the high LW exposure third and the low LW exposure third (and the same with Intelligence).

Step 8: To test if the difference between LWers and the general public is due to intelligence differences, I looked at the relationship between Intelligence and accuracy on each of the first 4 CFAR questions and extrapolated to estimate the accuracy rate of a LWer with an IQ of 100. First, I created a fake participant with a reported IQ of 100 and an IQTest of 100, and made sure they had an imputed score on the "Intelligence" composite. Then I re-ran the regression analyses (from Step 6) predicting "correct" answers for the first 4 CFAR questions based on the continuous Intelligence variable, and saved the predicted values. These are the estimates of the probability that a LWer would get that question "correct", given their Intelligence score. I looked at these values for the fake participant; they were still noticeably larger than the accuracy rates in the original study. I repeated this analysis two more times, using reported IQ and IQTest as the predictor variable instead of using the composite Intelligence score, and got similar results.

Step 9: I ran a few other analyses which are not part of the main storyline. I looked separately at the components of the composite predictor variables (LW Exposure & Intelligence) to see if they similarly predicted accuracy rates (they generally did). I looked for interaction effects between LW Exposure & Intelligence in predicting accuracy rates (including LW Exposure, Intelligence, and their interaction LW Exposure x Intelligence as predictor variables) - there were no significant effects. I looked at the anchoring question for the non-US LWers; first running an analysis only on the non-US group which paralleled my analysis on the US group, and then running an analysis with a US vs. non-US variable, anchor value, and their interaction as predictors of the estimate. I also ran a few other analyses in response to comments.

You should provide the source code for your analysis. It annoys me when people present results without the code that created it - if you were writing a new software library, would you just distribute precompiled binaries to everyone?

I don't have source code, since I did my analyses through the GUI of the software package that I learned in my grad school stats classes (JMP). I've been wanting to switch to R, but have been putting it off because of the transition costs.

I can write out a step by step summary of what I did; I'll try to have that posted tomorrow.

I see. Hard to believe that anyone would use a stats package which doesn't provide some sort of reproducible script or set of commands, but I guess if it's what you know...

I'd like to know about correlations between questions: P(correctX|correctY), for all levels of exposure, and correcting for IQ if possible.

I remember having answered correctly the first four questions, so I was about to measure how exceptional I was by multiplying the correct answers rates for each (assuming strong LW exposure). That would be 58% × 93% × 93% × 57% = 29%, or above the top third. Yay! (After finding out that my IQ is most probably below the median here, I needed the ego boost.)

But I realized just before I actually performed the calculation that I was assuming independence between answers, as if the different biases studied there where not at all correlated. Which when I think about it seems to be a quite ridiculous assumption. But we have the data. What does it says?

Starting simpler:

22% of people who answered all 4 questions got all 4 "correct." Although question 4 (decoy effect) doesn't really have a correct answer on the individual level (it's only designed to be informative in the aggregate, like the anchoring question). 39% of those who answered the first 3 got all 3 correct.

Breaking that down by LW exposure:

High LW exposure: 32% got 4/4, 52% got 3/3.
Med LW exposure: 19% got 4/4, 38% got 3/3.
Low LW exposure: 14% got 4/4, 26% got 3/3.

Breaking the high LW exposure group down by Intelligence (keeping in mind that sample sizes are getting smaller here):

High LW exposure and top third Intelligence: 42% got 4/4, 61% got 3/3.
High LW exposure and mid third Intelligence: 29% got 4/4, 56% got 3/3.
High LW exposure and bottom third Intelligence: 27% got 4/4, 50% got 3/3.
High LW exposure and Intelligence left blank: 26% got 4/4, 35% got 3/3.

That should cover the question that motivated you. I could give some further breakdowns but there are a lot of them.

Could someone explain the reasoning behind answer A being the correct choice in Question 4? My analysis was to assume that, since 30 migraines a year is still pretty terrible (for the same reason that the difference in utility between 0 and 1 migraines per year is larger than the difference between 10 and 11), I should treat the question as asking "Which option offers more migraines avoided per unit money?"

Option A: $350 / 70 migraines avoided = $5 per migraine avoided
Option B: $100 / 50 migraines avoided = $2 per migraine avoided

And when I did the numbers in my head I thought it was obvious that the answer should be B. What exactly am I missing that led the upper tiers of LWers to select option A?

You're answering the wrong question. "Which of these fixes more migraines per dollar" is a fast and frugal heuristic, but it doesn't answer the question of which you should purchase.

(If the first three wheels of your new car cost $10 each, but the fourth cost $100, you'd still want to shell out for the fourth, even though it gives you less wheel per dollar.)

In this case, the question you should be asking is, "is it worth another $250 a year to prevent another 20 migraines?", which is overwhelmingly worth it at even a lower-class time-money tradeoff rate. (The inability to do anything for several hours costs them more than $12.50 in outcomes- not to mention the agony.)

I think it's a pretty questionable assumption that the utility difference between 0 and 1 migraines a year is significantly greater than that between 10 and 11. Both are infrequent enough not to be a major disruptor of work, and also infrequent enough that the subject is used to the great majority of their time being non-migraine time.

Headaches avoided per unit money isn't a very good metric; by that measure, a hypothetical medicine D which prevents one headache per year, and costs a dollar, would be superior to medicines A-C. But medicine D leaves the patient nearly as badly off as they were to start with. A patient satisfied with medicine D would probably be satisfied with no medicine at all.

The metric I used to judge between A and B was to question whether, once the patient has already paid $100 to reduce their number of headaches from 100 to 50, they would still be willing to buy a further reduction of 60 hours of headaches at a rate of about 4.16 dollars per headache-hour. My answer was indeterminate, depending on assumptions about income, but I chose "yes" because I would have to assume very strict money constraints before the difference between A and B stops looking like a good deal.

The costs that get payed per prevented migraine in option B are irrelevant. The value of a prevented migraine isn't determined by the price that you pay to prevent a migraine.

The difference between option A and option B is that A prevents 20 additional migraines a cost of 250$. This means $12.50 per migraine. What kind of migraine are we talking about? A duration of 3 hours and involve intense pain, nausea, dizziness, and hyper-sensitivity to bright lights and loud noises.

$4.16 per hour of suffering migraine is lower than minimum wage. Normal minimum wage happens at a time that you can shedule in advance. You can't shedule your migraines in advance. They are likely to happen during the times where you have the most stress.

Being occupied with the migraine however isn't the only thing. Intensive pain also matters. You don't want people suffering intensive pain without good reason. Letting someone else suffer intensive pain is morally torture if you are a utilitarian.

The logic behind the question is that there is no correct answer, but Option B is more likely to be reflective of the decoy effect.

Consider these 3 decisions:

Decision 1: Option A: $350 / 70 migraines avoided Option B: $100 / 50 migraines avoided

Decision 2: Option A: $350 / 70 migraines avoided Option B: $100 / 50 migraines avoided Option C: $100 / 40 migraines avoided

Decision 3: Option A: $350 / 70 migraines avoided Option B: $100 / 50 migraines avoided Option D: $500 / 70 migraines avoided

There is no right or wrong answer to Decision 1, but if you choose A on Decision 1 then you should also choose A on Decisions 2 & 3 (since A & B are the same options, and C & D are clearly not better options). Similarly, if you choose B on Decision 1, you should choose B on Decisions 2 & 3. So responses to all 3 questions should be the same.

But it turns out that people who get Decision 2 are more likely to choose B than if they'd gotten Decision 1, because the presence of C (which is easily comparable to B, and clearly worse) makes B look better. And people who get Decision 3 are more likely to choose A than if they'd gotten Decision 1, because the presence of D (which is easily comparable to A, and clearly worse) makes A look better. This is called the decoy effect (or the attraction effect, or asymmetric dominance).

The ideal way to test this would be to divide people into three groups, and give each group one of these 3 decisions. But, failing that (with everyone taking the same survey), we can also just give everyone Decision 2 and guess that, if one subset of people is more likely than another to choose B, then they are more susceptible to the decoy effect. (Or, we could just give everyone Decision 3 and guess that, if a subset of people is more likely to choose A, then they are more susceptible to the decoy effect.) It's not a perfect design, but it is evidence.

(It's also possible that the difference arose because people were using the reasoning that Nick_Tarleton describes in his comment, in which case the question was tapping into something different than what it was designed to test.)

[-][anonymous]11y00

Look at it on the margin: A costs $250 more than B, and prevents 20 migraines. That could be a good deal.

There's no reason to look at the ratio for each treatment. Note that doing so would recommend B over A even if A cost 35 cents and B cost 10 cents; that can't be right.

[This comment is no longer endorsed by its author]Reply

I vaguely recall doing math on problem #2, and figuring "$20 in 60 days = $.33 dollar per day = not worth the time it takes to think about it." It looks like most people did some different math; what does that math look like?

On the anchoring question, I recommend putting a "click here to automatically generate a random number" button instead of a link to an external site. I'm pretty sure I read ahead and realized what the number would be used for, and I bet many others did, also.

I think anytime we are given a random number and then are told to give a numerical estimate, it's be obvious to most LWers that it's testing anchoring bias.

Agree-- my point was that I was able to guess a height before seeing the random number, hence it wasn't a good test.

Ah, okay. That makes more sense.

My math was "do I need any more money than I have or could borrow in the next 60 days, no, ok I'll take the higher amount". I suppose this heuristic fails if the higher amount is higher by less than the interest rate I could earn over 60 days, but short term interest rates are effectively 0 right now.

[-][anonymous]11y50

75>55

Really? Would you prefer $55 now or $75 ten minutes before the heat death of the universe?

(Edit: Point being, of course, that whatever the correct math is, it needs to factor in time somehow...)

(Edit 2: Or maybe that was the joke and I didn't get it...)

[-][anonymous]11y50

75/55 in two month intuitively looks like a huge return, with no work required.

I guess you want to know about at the math in the general case? I'm not familiar with that, sorry.

I agree with your reasoning. I think it's sort of silly to suppose that selecting $55 implying a high discount rate is necessarily less rational. If someone offered you $1.00 today or $1.03 tomorrow, it would be a very strange person who decides that they would prefer the $1.03, even though that's a 4,848,172% rate of return. There are actually disutilities involved in keeping track of future cash inflows, even if small. But almost no one, even the triumphantly rich with very low marginal utilities associated with additional dollars, would prefer $55 million dollars now to $75 million dollars in 60 days. Perhaps drug addicts and the terminally ill would.

Of course, if someone said "Occasionally throughout your life we will surprise you with either $55 dollars, or $75 dollars 60 days later than the days we would have surprised you with $55," it would then be silly to choose the $55 dollar scenario, because in that case there are no disutilities.

Interestingly, the survey data actually doesn't seem to support the notion that those with higher incomes are more likely to take the money now. In fact, the data suggest that the wealthiest are slightly less likely to take the money today. For five quintiles, the P($55 now) is 8.4% for <$7,770, 6.2% for <$25K, 6.8% for <$50K, 6.4% for <$83K, and 5.0% for the top quintile. This is with N=593. This doesn't include any analysis for potentially confounding factors.

I'm puzzled by your use of the word "actually" in the last paragraph. It sounds as if you're saying you'd expect people with higher incomes to choose less money now over more money later. If so, why? I'd expect the reverse, for multiple reasons (and not only because apparently that's what the survey shows).

I was thinking that wealthier people would have a lower level of utility associated with the additional marginal dollars they could get from waiting, so transaction costs and other disutilities would override the utility offered by having a greater number of dollars. I said "actually" only because this thinking is sort of in line with what I was discussing in my first paragraph.

I would think that if our hypothetical wealthy person was rich enough to not care all that much about an additional $20 million, they wouldn't care enough about the initial $55 million to pick the short-term option that netted them less money.

If the utility of the dollars goes down, doesn't the utility loss from transaction costs also go down? (Because if you care less about the money you can worry less about it. Extreme case: If I offer to give you $0 in a week, you don't need to waste any time or effort keeping track of when I'm supposed to pay you.)

People get wealthy by prefering money in the feature over getting utility right now.

There a bunch of good psychology research that associates picking money now instead fo more money later with low willpower and thus less earning capaticity.

I'd pick the $1.03, so long as it was in the form of an electronic funds transfer and not more pennies to clutter up my pockets. I guess I probably qualify as a very strange person though?

If you get rid of every possible transaction cost, even implicit costs of the inconvenience of dealing with the transaction and currency, then I would agree that many people would take the $1.03. In fact, I would too. It's only that these costs are neglected as not being real that I see a problem with. Utility of money just isn't that simple. For example, some people might actually prefer that the money be handed to them than that it be transferred to their bank account even at the same moment, because if they have found money in-hand, they won't feel guilty about using it to purchase something "for themselves" rather than paying bills. In fact, someone might prefer $50 handed to them in cash over $55 transferred to and immediately available in their bank account. Is that irrational, or are they just satisfying their utility function in light of their cognitive limits to control their feelings of guilt?

In some ways people want it to be answered in an ideal scenario, as if it's a physics problem, but I don't think that's how most people answer questions like this. Most people read the question, imagine a particular scenario (or set of scenarios, maybe), and answer for that scenario. If you want people to answer in an ideal scenario than the question is underspecified. Also, it's not clear to me that you can make it an ideal scenario and retain the effect in question, because the more people look at it like a physics problem with a right answer, the less likely they are to answer in a way that's not in line with their behavior in normal circumstances which you are trying to predict.

I figured that ~36% discount over two months would be way too high.

The survey wasn't timed - so maybe those more "into" the site put more time and effort into answering the questions. So: I don't think much in the way of conclusions about bias can be drawn.

Looking, like before, at number of missing answers (which seems like an awful good proxy for how much time one puts into the survey), the people who were right on the first question answer 1 more question but that small difference doesn't reach statistical significance:

R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> right <- lw[as.character(lw$CFARQuestion1) == "Yes",]$MissingAnswers
R> wrong <- lw[as.character(lw$CFARQuestion1) == "no" | as.character(lw$CFARQuestion1) == "Cannot be determined",]$MissingAnswers
R> t.test(right, wrong)
        Welch Two Sample t-test

    data:  right and wrong
    t = -1.542, df = 942.5, p-value = 0.1234
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -2.3817  0.2858
    sample estimates:
    mean of x mean of y
        16.69     17.74

(I'm not going to look at the other questions unless someone really wants me to, since the first question is the one that would benefit the most from some extended thought, unlike the others.)

Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there's no real relationship*:

R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> lw <- lw[as.character(lw$CFARQuestion1) != " " & !is.na(as.character(lw$CFARQuestion1)),]
R> lw <- data.frame(lw$CFARQuestion1, lw$MissingAnswers)
R> summary(glm(lw.CFARQuestion1 ~ lw.MissingAnswers, data = lw, family = "binomial"))

Deviance Residuals:
   Min      1Q  Median      3Q     Max
 -1.17   -1.12   -1.05    1.23    1.41

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)        0.00111    0.12214    0.01     0.99
lw.MissingAnswers -0.00900    0.00607   -1.48     0.14

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1366.6  on 989  degrees of freedom
Residual deviance: 1364.4  on 988  degrees of freedom
AIC: 1368

Number of Fisher Scoring iterations: 3

* a note to other analyzers: it's really important to remove null answers/NAs because they'll show relationships all over the place. In this example, if you leave NAs in for the CFARQuestion1 field, you'll wind up getting a very statistically significant relationship - because every CFARQuestion left NA by definition increases MissingAnswers by 1! And people who didn't answer that question probably didn't answer a lot of other questions, so the NA respondents enable a very easy reliable prediction of MissingAnswers...

How do you get this nice box for the code? What's the magic command that you have to tell the Wiki?

Markdown code syntax is indent each line by >=4 spaces; LW's implementation is subtly broken since it's stripping all the internal indentation, and another gotcha is that you can't have any trailing whitespace or lines will be combined in a way you probably don't want.

MediaWiki syntax is entirely different and partially depends on what extensions are enabled.

That seems like a rather post hoc explanation.