Are these cognitive biases, biases?

Kaj_Sotala

Continuing my special report on people who don't think human reasoning is all that bad, I'll now briefly present some studies which claim that phenomena other researchers have considered signs of faulty reasoning aren't actually that. I found these from Gigerenzer (2004), which I in turn found when I went looking for further work done on the Take the Best algorithm.

Before we get to the list - what is Gigerenzer's exact claim when he lists these previous studies? Well, he's saying that minds aren't actually biased, but may make judgments that seem biased in certain environments.

Table 4.1 Twelve examples of phenomena that were first interpreted as "cognitive illusions" but later revalued as reasonable judgments given the environmental structure. [...]

The general argument is that an unbiased mind plus environmental structure (such as unsystematic error, unequal sample sizes, skewed distributions) is sufficient to produce the phenomenon. Note that other factors can also contribute to some of the phenomena. The moral is not that people would never err, but that in order to understand good and bad judgments, one needs to analyze the structure of the problem or of the natural environment.

On to the actual examples. Of the twelve examples referenced, I've included three for now.

The False Consensus Effect

Bias description: People tend to imagine that everyone responds the way they do. They tend to see their own behavior as typical. The tendency to exaggerate how common one’s opinions and behavior are is called the false consensus effect. For example, in one study, subjects were asked to walk around on campus for 30 minutes, wearing a sign board that said "Repent!". Those who agreed to wear the sign estimated that on average 63.5% of their fellow students would also agree, while those who disagreed estimated 23.3% on average.

Counterclaim (Dawes & Mulford, 1996): The correctness of reasoning is not estimated on the basis of whether or not one arrives at the correct result. Instead, we look at whether reach reasonable conclusions given the data they have. Suppose we ask people to estimate whether an urn contains more blue balls or red balls, after allowing them to draw one ball. If one person first draws a red ball, and another person draws a blue ball, then we should expect them to give different estimates. In the absence of other data, you should treat your own preferences as evidence for the preferences of others. Although the actual mean for people willing to carry a sign saying "Repent!" probably lies somewhere in between of the estimates given, these estimates are quite close to the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief. A study by the authors suggested that people do actually give their own opinion roughly the right amount of weight.

Overconfidence / Underconfidence

Bias description: Present people with binary yes/no questions. Ask them to specify how confident they are, on a scale from .5 to 1, in that they got the answer correct. The mean subjective probability x assigned to the correctness of general knowledge items tends to exceed the proportion of correct answers c, x - c > 0; people are overconfident. The hard-easy effect says that people tend to be underconfident in easy questions, and overconfident in hard questions.

Counterclaim (Juslin, Winman & Olsson 2000): The apparent overconfidence and underconfidence effects are caused by a number of statistical phenomena, such as scale-end effects, linear dependency, and regression effects. In particular, the questions in the relevant studies have been selectively drawn in a manner that is unrepresentative of the actual environment, and thus throws off the participants' estimates of their own accuracy. Define a "representative" item sample as one coming from a study containing explicit statements that (a) a natural environment had been defined and (b) the items had been generated by random sampling of this environment. Define any studies that didn't describe how the items had been chosen, or that explicitly describe a different procedure, as having a "selected" item sample. A survey of several studies contained 95 independent data points with selected item samples and 35 independent data points with representative item samples, where "independence" means different participant samples (i.e. all data points were between subjects).

For studies with selected item samples, the mean subjective probability was .73 and the actual proportion correct was .64, indicating a clear overconfidence effect. However, for studies with representative item samples, the mean subjective probability was .73 and the proportion correct was .72, indicating close to no overconfidence. The over/underconfidence effect of nearly zero for the representative samples was also not a mere consequence of averaging: for the selected item samples, the mean absolute bias was .10, while for the representative item samples it was .03. Once scale-end effects and linear dependency are controlled for, the remaining hard-easy effect is rather modest.

What does the "representative" sample mean? If I understood correctly: Imagine that you know that 30% of the people living in a certain city are black, and 70% are white. Next you're presented with questions where you have to guess whether a certain inhabitant of the city is black or white. If you don't have any other information, you know that consistently guessing "white" in every question will get you 70% correct. So when the questionnaire also asks you for your calibration, you say that you're 70% certain for each question.

Now, assuming that the survey questions had been composed by randomly sampling from all the inhabitants of the city (a "representative" sampling), then you would indeed be correct about 70% of the time and be well-calibrated. But assume that instead, all the people the survey asked about live in a certain neighborhood, which happens to be predominantly black (a "selected" sampling). Now you might have only 40% right answers, while you indicated a confidence of 70%, so the researchers behind the survey mark you as overconfident.

Availability Bias

Bias description: We estimate probabilities based on how easily they're recalled, not based on their actual frequency. Tversky & Kahneman conducted a classic study where participants were given five consonants (K, L, N, R, V), and were asked to estimate whether the letter appeared more frequently as the first or the third letter of a word. Each was judged by the participants to occur more frequently as the first letter, even though all five actually occur more frequently as the third letter. This was assumed to be because words starting with a particular letter are more easily recalled than words that have a particular letter in the third position.

Counterclaim (Sedlmeier, Hertwig & Gigerenzer 1998): Not only does the only replication of Tversky & Kahneman's result seem to be a single one-page article, it seems to be contradicted by a number of studies suggesting that memory is often (though not always) excellent in storing the frequency information from various environments. In particular, several authors have documented that participants' judgments of the frequency of letters and words generally show a remarkable sensitivity to the actual frequencies. The one previous study that did try to replicate the classical experiment, failed to do so. It used Tversky & Kahneman's five consonants, all more frequent in the third position, and also five other consonants that were more frequent in the first position. All five consonants that appear more often in the first position were judged to do so; three of the five consonants that appear more frequently in the third position were also judged to do so.

The classic article did not specify a mechanism for how the availability heuristic might work. The current authors considered four different mechanisms. Availability by number states that if asked for the proportion in which a certain letter occurs in the first versus in a later position in words, one produces words with this letter in the respective positions and uses the produced proportion as an estimate for the actual proportion. Availability by speed states that one produces single words with the letter in this position, and uses the time ratio of the retrieval times as an estimate of the actual proportion. The letter class hypothesis notes that the original sample was atypical; most consonants (12 of 20) are in fact more frequent in the first position. This hypothesis assumes that people know whether consonants or vowels are more frequent in which position, and default to that knowledge. The regressed frequencies hypothesis assumes that people do actually have a rather good knowledge of the actual frequencies, but that the estimates are regressed towards the mean: low frequencies are overestimated and large frequencies underestimated.

After two studies made to calibrate the predictions of the availability hypotheses, three main studies were conducted. In each, the participants were asked whether a certain letter was more frequent in the first or second position of all German words. They were also asked about the proportions of each letter appearing in the first or second position. Study one was a basic replication of the Tversky & Kahneman study, albeit with more letters. Study two was designed to be favorable to the letter class hypothesis: each participant was only given one letter whose frequency to judge instead of several. It was thought that participants may have switched away from a letter class strategy when presented with multiple consonants and vowels. Study three was designed to be favorable to the availability hypotheses, in that the participants were made to first produce words with the letters O, U, N and R in the first and second position (90 seconds per letter) before proceeding as in study one. Despite two of the studies having been explicitly constructed to be favorable to the other hypotheses, the predictions of the regressed frequency hypothesis had the best match to the actual estimates in all three studies. Thus it seems that people are capable of estimating letter frequencies, although in a regressed form.

The authors propose two different explanations for the discrepancy of results with the classic study. One is that the corpus used by Tversky & Kahneman only covers words at least three letters long, but English has plenty of one- and two-letter words. The participants in the classic study were told to disregard words with less than three letters, but it may be that they were unable to properly do so. Alternatively, it may have been caused by the use of an unrepresentative sample of letters: had the authors used only consonants that are more frequent in the second position, then they too would have reported that the frequency of the those letters in the first position is overestimated. However, a consideration of all the consonants tested shows that the frequency of those in the first position is actually underestimated. This disagrees with the interpretation by Tversky & Kahneman, and implies a regression effect as the main cause.

EDIT: This result doesn't mean that the availability heuristic would be a myth, of course. It is, AFAIK, true that e.g. biased reporting in the media will throw off people's conceptions of what events are the most likely. But one probably wouldn't be too far from the truth if they said that in that case, the brain is still computing relative frequencies correctly, given the information at hand - it's just that the media reporting is biased. The claim that there are some types of important information for which the mind has particular difficulty assessing relative frequencies correctly, though, doesn't seem to be as supported as is sometimes claimed.

Regarding the "Repent" example: as conformists, human beings are more likely to make particular decisions (like wear a "Repent" sign) if they believe others would do the same. So instead of framing this study as showing that "sign-wearing volunteers overestimate the probability others would volunteer", one could flip the implied causality and say "people who think others would volunteer are more likely to volunteer themselves", a much more banal claim. One could test the effect by re-running the experiment on self-identified nonconformists, or using behaviors for which conformity is not believed to play a big role. I predict the False Consensus Effect discovered in those settings would be much weaker.

The blue/red ball analogy is good food for thought, but there are way too many differences between it and the "Repent" study for the numerical similarity to be considered anything more than a coincidence. Our approximations of other people's behavior are much more elaborate than making a naive inference based on a sample of one.

In the first example,

A study by the authors suggested that people do actually give their own opinion roughly the right amount of weight.

Does that mean "roughly as much weight as a Bayesian calculation with a uniform prior would"? As if the subjects had never looked at other people before? Doesn't sound very encouraging.

The second example was a bit tricky for me to parse right now. The third one, however, stunned me. So the availability heuristic is a myth? Can our resident experts chime in now, please?

"Roughly the right amount of weight" may have been a miswording on my part - they didn't provide any calculation of what would have been the ideal Bayesian weight to put on your own opinion, as compared to the weight the participants put. However, there was a consensus effect, and the subjects were relatively accurate in predicting how others would behave. I do admit that my grasp of statistics isn't the strongest in the world, so I had to go by what the authors verbally reported.

Judges' beliefs that others responded the same way they do was positively, not negatively, related to accuracy — whether this relationship was evaluated within people across people within items, or across people across items. In addition, in the across people across items analysis (who is more accurate than whom?), optimal weighting of own response was generally positive, which we interpreted as contrary to the assertion that people "overweight" their own response.

As for the third study - well, it depends on how you interpret the availability heuristic. It is, AFAIK, true that e.g. biased reporting in the media will throw off people's conceptions of what events are the most likely. But one probably wouldn't be too far from the truth if they said that in that case, the brain is still computing relative frequencies correctly, given the information at hand - it's just that the media reporting is biased. The claim that there are some types of important information for which the mind has particular difficulty assessing relative frequencies correctly, though, doesn't seem to be as supported as is sometimes claimed.

No need to be scared of statistics! This part:

In the absence of other data, you should treat your own preferences as evidence for the preferences of others... the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief

refers to the rule of succession.

Oh, I did get that part. The bit I didn't entirely follow when the authors had a longer discussion of different calculated phi values regarding the connection between the measured consensus effect and the participant's accuracy in the study. For one, I didn't recognize the term "phi" - Wikipedia implied that it might be the result of a chi-square test, which we did cover in the statistics 101 course I've taken, but it might have been too long ago as I'm not sure of how exactly that test applies in this case or how the phi value should be interpreted.

The whole "Repent!" sign experiment strikes me as very strange. Repent, Sinners? Repent, Deniers?

Knowing where and when the question was posed would be hugely informative. On a religious-affiliated campus, one would expect more people who buy into repenting for sins, therefore justifying the higher estimate.

Not the point of the post, I know, but this experiement by itself is poking at my brain.

For example, in one study, subjects were asked to walk around on campus for 30 minutes, wearing a sign board that said "Repent!". Those who agreed estimated that on average 63.5% of their fellow students would also agree, while those who disagreed estimated 23.3% on average.

Huh? I think you're missing a sentence. Agreed with what? Repenting?

Agreed to walk around on campus with a repent sign, presumably.

Indeed. Edited the post to clarify.

Regarding availability, it had always seemed a bit strange to me that people would estimate words that start with those letters as more frequent than words with them in the third position. A list of rhyming, similar sounding words (vane, mane, cane, lane, line, fine, sine, mine, bone, cone, hone, lone, for example) seems at the very least just as easily recalled, just as available as a list of words with the same starting letter. Maybe this is just a poor test of the heuristic (I believe there are several other demonstrations of it in JUU).

The regressed frequencies hypothesis assumes that people do actually have a rather good knowledge of the actual frequencies, but that the estimates are regressed towards the mean: low frequencies are underestimated and large frequencies overestimated.

Should this read low frequencies are overestimated and large frequencies underestimated?

...yes, it should. Thank you, fixed.

In some ways it doesn't even matter if we are biased and suffer for it.

It might be that our brains only have enough processing power to make sure we don't seriously disadvantage ourselves by causing driving accidents, tripping over, not stabbing ourselves with knives or committing social faux pas. Anything left over for solving psychologists games has to be considered a bonus.

E.T. Jaynes has also argued in this direction although I can't give you the exact location right now, but it is in Probability Theory: The Logic Of Science.

I tend to think of the availability bias more in the sense that our thoughts and actions are constrained by the familiar or by individual examples. When our awareness of some particular thing is raised, we tend to view that thing as more important or more probable, even though this isn't necessarily justified in the modern world. One example would be the Columbine shooting; while there was no change in technical feasibility (if anything, it got harder), school shootings became much more common, likely in part because people realized, "Hey, I could do that!" Or when someone quits smoking because a relative gets lung cancer - they already knew there was a risk, but now that they know what lung cancer is, they actually realize they should do something. Other (hypothetical) examples abound.

Estimating letter position in one's native language does seem like a very limited application of this heuristic; also, in the case of the 5-letter study, adding additional letters may improve judgement, since while you're thinking of all those other letters, you think of words with some of the letters in the third position. Or, you just think, "Probably about half are first letters and half are third letters; these five seem more like first letters, so the rest are probably third letters." But the availability heuristic really seems to extend well beyond this limited application, so arguing its non-existence through this evidence is unconvincing, at least for me. As one can see from the wikipedia article, there's a bit more evidence for it than just letter-position studies.

As I replied to cousin_it below (and have now edited to the article), yes, this certainly doesn't mean that the availability heuristic would be nonexistent.

Define a "representative" item sample as one coming from a study containing explicit statements that (a) a natural environment had been defined and (b) the items had been generated by random sampling of this environment.

Can you elaborate on what this actually means in practice? It doesn't make much sense to me, and the paper you linked to is behind a paywall.

(It doesn't make much sense because I don't see how you could rigorously distinguish between a "natural" or "unnatural" environment for human decision-making. But maybe they're just looking for cases where experimenters at least tried, even without rigor?)

If I understood the paper correctly, the following situation would be analogous. (I'll have to recheck it tomorrow to make sure this example does match what they're actually saying - it's too late here for me to do it now.)

Imagine that you know that 30% of the people living in a certain city are black, and 70% are white. Next you're presented with questions where you have to guess whether a certain inhabitant of the city is black or white. If you don't have any other information, you know that consistently guessing "white" in every question will get you 70% correct. So when the questionnaire also asks you for your calibration, you say that you're 70% certain for each question.

Now, assuming that the survey questions had been composed by randomly sampling from all the inhabitants of the city (a "representative" sampling), then you would indeed be correct about 70% of the time and be well-calibrated. But assume that instead, all the people the survey asked about live in a certain neighborhood, which happens to be predominantly black (a "selected" sampling). Now you might have only 40% right answers, while you indicated a confidence of 70%, so the researchers behind the survey mark you as overconfident.

Of course, in practice this is a bit more complicated as people don't only use the ecological base rate but also other information that they happen to have at hand, but since the other information acts to modify their starting base rate (the prior), the same logic still applies.

This result doesn't mean that the availability heuristic would be a myth, of course. It is, AFAIK, true that e.g. biased reporting in the media will throw off people's conceptions of what events are the most likely. But one probably wouldn't be too far from the truth if they said that in that case, the brain is still computing relative frequencies correctly, given the information at hand - it's just that the media reporting is biased.

It doesn't seem fair to say that humans are, given the information, arriving at justifiable conclusions, when humans are also the ones giving the information. "Humans believe the right things given the information," doesn't seem as persuasive when you attach "Humans skew information when passing it to make it seem more likely than it is."

Regarding availability, it had always seemed a bit strange to me that people would estimate words that start with those letters as more frequent than words with them in the third position. A list of rhyming, similar sounding words (vane, mane, cane, lane, line, fine, sine, mine, bone, cone, hone, lone, for example) seems at the very least just as easily recalled as a list of words with the same starting letter. Maybe this is just a poor test of the heuristic - I believe there are several other demonstrations of it in JUU.

As someone who is just learning about how incredibly biased I can be... even when I am trying my hardest not to be biased... I love this stuff.

Of course, the one thing that I have learned is best for eliminating bias is to have yourself a selection of people from which one may poll for an answer. I guess that one would need to make certain that the correct people were chosen???

Well, he's saying that minds aren't actually biased, but may make judgments that seem biased in certain environments.

I'm guessing that if "certain environments" are significant enough to the survival of the brain, the claim that the brain is biased will still be valid.

I wouldn't be surprised at all if it turns out that biases are in fact moving targets; human brains may have mechanisms for correcting for biases over periods of time that are slower than the current rate of change, but much faster than evolution (say, on the order of generations).

I'm guessing that if "certain environments" are significant enough to the survival of the brain, the claim that the brain is biased will still be valid.

Certainly. But then, all learning algorithms are biased.

Although the actual mean for people willing to carry a sign saying "Repent!" probably lies somewhere in between of the estimates given, these estimates are quite close to the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief.

Huh? (Many more assumptions missing.)

In the first example,

A study by the authors suggested that people do actually give their own opinion roughly the right amount of weight.

Does that mean "roughly as much weight as a Bayesian calculation with a uniform prior would"? As if the subjects had never looked at other people before? Doesn't sound very encouraging.

The second example was a bit tricky for me to parse right now. The third one, however, stunned me. So the availability heuristic is a myth? Can our resident experts chime in now, please?

Judges' beliefs that others responded the same way they do was positively, not negatively, related to accuracy — whether this relationship was evaluated within people across people within items, or across people across items. In addition, in the across people across items analysis (who is more accurate than whom?), optimal weighting of own response was generally positive, which we interpreted as contrary to the assertion that people "overweight" their own response.

No need to be scared of statistics! This part:

In the absence of other data, you should treat your own preferences as evidence for the preferences of others... the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief

refers to the rule of succession.

The whole "Repent!" sign experiment strikes me as very strange. Repent, Sinners? Repent, Deniers?

Not the point of the post, I know, but this experiement by itself is poking at my brain.

For example, in one study, subjects were asked to walk around on campus for 30 minutes, wearing a sign board that said "Repent!". Those who agreed estimated that on average 63.5% of their fellow students would also agree, while those who disagreed estimated 23.3% on average.

Huh? I think you're missing a sentence. Agreed with what? Repenting?

Agreed to walk around on campus with a repent sign, presumably.

Indeed. Edited the post to clarify.

The regressed frequencies hypothesis assumes that people do actually have a rather good knowledge of the actual frequencies, but that the estimates are regressed towards the mean: low frequencies are underestimated and large frequencies overestimated.

Should this read low frequencies are overestimated and large frequencies underestimated?

...yes, it should. Thank you, fixed.

In some ways it doesn't even matter if we are biased and suffer for it.

E.T. Jaynes has also argued in this direction although I can't give you the exact location right now, but it is in Probability Theory: The Logic Of Science.

As I replied to cousin_it below (and have now edited to the article), yes, this certainly doesn't mean that the availability heuristic would be nonexistent.

Define a "representative" item sample as one coming from a study containing explicit statements that (a) a natural environment had been defined and (b) the items had been generated by random sampling of this environment.

Can you elaborate on what this actually means in practice? It doesn't make much sense to me, and the paper you linked to is behind a paywall.

This result doesn't mean that the availability heuristic would be a myth, of course. It is, AFAIK, true that e.g. biased reporting in the media will throw off people's conceptions of what events are the most likely. But one probably wouldn't be too far from the truth if they said that in that case, the brain is still computing relative frequencies correctly, given the information at hand - it's just that the media reporting is biased.

Regarding availability, it had always seemed a bit strange to me that people would estimate words that start with those letters as more frequent than words with them in the third position. A list of rhyming, similar sounding words (vane, mane, cane, lane, line, fine, sine, mine, bone, cone, hone, lone, for example) seems at the very least just as easily recalled as a list of words with the same starting letter. Maybe this is just a poor test of the heuristic - I believe there are several other demonstrations of it in JUU.

As someone who is just learning about how incredibly biased I can be... even when I am trying my hardest not to be biased... I love this stuff.

Well, he's saying that minds aren't actually biased, but may make judgments that seem biased in certain environments.

I'm guessing that if "certain environments" are significant enough to the survival of the brain, the claim that the brain is biased will still be valid.

I'm guessing that if "certain environments" are significant enough to the survival of the brain, the claim that the brain is biased will still be valid.

Certainly. But then, all learning algorithms are biased.

Although the actual mean for people willing to carry a sign saying "Repent!" probably lies somewhere in between of the estimates given, these estimates are quite close to the one-third and two-thirds estimates that would arise from a Bayesian analysis with a uniform prior distribution of belief.

Huh? (Many more assumptions missing.)

46

Are these cognitive biases, biases?

46

46

46