# Simultaneous Overconfidence and Underconfidence

*Follow-up to this and this on my personal blog. Prep for this meetup. Cross-posted on my blog.*

##
Eliezer talked about cognitive bias, statistical bias, and inductive bias in a series of posts only the first of which made it directly into the LessWrong sequences as currently organized (unless I've missed them!). Inductive bias helps us leap to the right conclusion from the evidence, if it captures good prior assumptions. Statistical bias can be good or bad, depending in part on the bias-variance trade-off. Cognitive bias refers only to obstacles which prevent us from thinking well.

Unfortunately, as we shall see, psychologists can be *quite inconsistent *about how cognitive bias is defined. This created a paradox in the history of cognitive bias research. One well-researched and highly experimentally validated effect was conservatism, the tendency to give estimates too middling, or probabilities too near 50%. This relates especially to integration of information: when given evidence relating to a situation, people tend not to take it fully into account, as if they are stuck with their prior. Another highly-validated effect was overconfidence, relating especially to calibration: when people give high subjective probabilities like 99%, they are typically wrong with much higher frequency.

In real-life situations, these two contradict: there is no clean distinction between information integration tasks and calibration tasks. A person's subjective probability is always, in some sense, the integration of the information they've been exposed to. In practice, then, when should we expect other people to be under- or over- confident?

Eliezer talked about cognitive bias, statistical bias, and inductive bias in a series of posts only the first of which made it directly into the LessWrong sequences as currently organized (unless I've missed them!). Inductive bias helps us leap to the right conclusion from the evidence, if it captures good prior assumptions. Statistical bias can be good or bad, depending in part on the bias-variance trade-off. Cognitive bias refers only to obstacles which prevent us from thinking well.

Unfortunately, as we shall see, psychologists can be *quite inconsistent *about how cognitive bias is defined. This created a paradox in the history of cognitive bias research. One well-researched and highly experimentally validated effect was conservatism, the tendency to give estimates too middling, or probabilities too near 50%. This relates especially to integration of information: when given evidence relating to a situation, people tend not to take it fully into account, as if they are stuck with their prior. Another highly-validated effect was overconfidence, relating especially to calibration: when people give high subjective probabilities like 99%, they are typically wrong with much higher frequency.

In real-life situations, these two contradict: there is no clean distinction between information integration tasks and calibration tasks. A person's subjective probability is always, in some sense, the integration of the information they've been exposed to. In practice, then, when should we expect other people to be under- or over- confident?

## Simultaneous Overconfidence and Underconfidence

##
The conflict was resolved in an excellent paper by Ido Ereve et al which showed that it's the result of how psychologists did their statistics. Essentially, one group of psychologists defined bias one way, and the other defined it another way. The results are not really contradictory; they are measuring different things. In fact, you can find underconfidence or overconfidence in* **the same data sets** *by applying the different statistical techniques; it has little or nothing to do with the differences between information integration tasks and probability calibration tasks. Here's my rough drawing of the phenomenon (apologies for my hand-drawn illustrations):

Overconfidence here refers to probabilities which are more extreme than they should be, here illustrated as being further from 50%. (This baseline makes sense when choosing from two options, but won't always be the right baseline to think about.) Underconfident subjective probabilities are associated with more extreme objective probabilities, which is why the slope tilts up in the figure. Overconfident similarly tilts down, indicating that the subjective probabilities are associated with less-extreme objective probabilities. Unfortunately, if you don't know how the lines are computed, this means less than you might think. Ido Ereve et al show that these two regression lines can be derived from just one data-set. I found the paper easy and fun to read, but I'll explain the phenomenon in a different way here by relating it to the concept of statistical bias and tails coming apart.

The conflict was resolved in an excellent paper by Ido Ereve et al which showed that it's the result of how psychologists did their statistics. Essentially, one group of psychologists defined bias one way, and the other defined it another way. The results are not really contradictory; they are measuring different things. In fact, you can find underconfidence or overconfidence in* **the same data sets** *by applying the different statistical techniques; it has little or nothing to do with the differences between information integration tasks and probability calibration tasks. Here's my rough drawing of the phenomenon (apologies for my hand-drawn illustrations):

Overconfidence here refers to probabilities which are more extreme than they should be, here illustrated as being further from 50%. (This baseline makes sense when choosing from two options, but won't always be the right baseline to think about.) Underconfident subjective probabilities are associated with more extreme objective probabilities, which is why the slope tilts up in the figure. Overconfident similarly tilts down, indicating that the subjective probabilities are associated with less-extreme objective probabilities. Unfortunately, if you don't know how the lines are computed, this means less than you might think. Ido Ereve et al show that these two regression lines can be derived from just one data-set. I found the paper easy and fun to read, but I'll explain the phenomenon in a different way here by relating it to the concept of statistical bias and tails coming apart.

## The Tails Come Apart

##
Everyone who has read Why the Tails Come Apart will likely recognize this image:

The idea is that even if X and Y are highly correlated, the most extreme X values and the most extreme Y values will differ. I've labelled the difference the "curse" after the optimizer's curse: if you optimize a criteria X which is merely correlated with the thing Y you actually want, you can expect to be disappointed.

Applying the idea to calibration, we can say that the most extreme subjective beliefs are almost certainly not the most extreme on the objective scale. That is: a person's most confident beliefs are almost certainly overconfident. A belief is not likely to have worked its way up to the highest peak of confidence by merit alone. It's far more likely that some merit but also some error in reasoning combined to yield high confidence. This sounds like the calibration literature, which found that people are generally overconfidant. What about underconfidence? By a symmetric argument, the points with the most extreme *objective* probabilities are not likely to be the same as those with the highest *subjective *belief; errors in our thinking are much more likely to make us underconfidant than overconfidant in those cases.

This argument tells us about extreme points, but not about the overall distribution. So, how does this explain simultaneous overconfidence and underconfidence? To understand that, we need to understand the statistics which psychologists used. We'll use averages rather than maximums, leading to a "soft version" which shows the tails coming apart gradually, rather than only at extreme ends.

Everyone who has read Why the Tails Come Apart will likely recognize this image:

The idea is that even if X and Y are highly correlated, the most extreme X values and the most extreme Y values will differ. I've labelled the difference the "curse" after the optimizer's curse: if you optimize a criteria X which is merely correlated with the thing Y you actually want, you can expect to be disappointed.

Applying the idea to calibration, we can say that the most extreme subjective beliefs are almost certainly not the most extreme on the objective scale. That is: a person's most confident beliefs are almost certainly overconfident. A belief is not likely to have worked its way up to the highest peak of confidence by merit alone. It's far more likely that some merit but also some error in reasoning combined to yield high confidence. This sounds like the calibration literature, which found that people are generally overconfidant. What about underconfidence? By a symmetric argument, the points with the most extreme *objective* probabilities are not likely to be the same as those with the highest *subjective *belief; errors in our thinking are much more likely to make us underconfidant than overconfidant in those cases.

This argument tells us about extreme points, but not about the overall distribution. So, how does this explain simultaneous overconfidence and underconfidence? To understand that, we need to understand the statistics which psychologists used. We'll use averages rather than maximums, leading to a "soft version" which shows the tails coming apart gradually, rather than only at extreme ends.

## Statistical Bias

##
Statistical bias is defined through the notion of an estimator. We have some quantity we want to know, X, and we use an estimator to guess what it might be. The estimator will be some calculation which gives us our estimate, which I will write as X^. An estimator is derived from noisy information, such as a sample drawn at random from a larger population. The difference between the estimator and the true value, X^-X, would ideally be zero; however, this is unrealistic. We expect estimators to have error, but *systematic* error is referred to as bias.

Given a particular value for X, the bias is defined as the *expected* value of X^-X, written E_{X}(X^-X). An *unbiased estimator* is an estimator such that E_{X}(X^-X)=0 for any value of X we choose.

Due to the bias-variance trade-off, unbiased estimators are not the best way to minimize error in general. However, statisticians still love unbiased estimators. It's a nice property to have, and in situations where it works, it has a more objective feel than estimators which use bias to further reduce error.

Notice, the definition of bias is taking *fixed X*; that is, it's fixing the quantity which we *don't *know. Given a fixed X, the unbiased estimator's average value will equal X. This is a picture of bias which can only be evaluated "from the outside"; that is, from a perspective in which we can fix the unknown X.

A more inside-view of statistical estimation is to consider a fixed body of *evidence*, and make the estimator equal the average *unknown*. This is exactly inverse to unbiased estimation:

In the image, we want to estimate unknown Y from observed X. The two variables are correlated, just like in the earlier "tails come apart" scenario. The average-Y estimator tilts down because *good estimates tend to be conservative*: because I only have partial information about Y, I want to take into account what I see from X but also pull toward the average value of Y to be safe. On the other hand, *unbiased estimators tend to be overconfident*: the effect of X is exaggerated. For a fixed Y, the average Y^ is supposed to equal Y. However, for fixed Y, the X we will get will lean toward the mean X (just as for a fixed X, we observed that the average Y leans toward the mean Y). Therefore, in order for Y^ to be high enough, it needs to pull up sharply: middling values of X need to give more extreme Y^ estimates.

If we superimpose this on top of the tails-come-apart image, we see that this is something like a generalization:

Statistical bias is defined through the notion of an estimator. We have some quantity we want to know, X, and we use an estimator to guess what it might be. The estimator will be some calculation which gives us our estimate, which I will write as X^. An estimator is derived from noisy information, such as a sample drawn at random from a larger population. The difference between the estimator and the true value, X^-X, would ideally be zero; however, this is unrealistic. We expect estimators to have error, but *systematic* error is referred to as bias.

Given a particular value for X, the bias is defined as the *expected* value of X^-X, written E_{X}(X^-X). An *unbiased estimator* is an estimator such that E_{X}(X^-X)=0 for any value of X we choose.

Due to the bias-variance trade-off, unbiased estimators are not the best way to minimize error in general. However, statisticians still love unbiased estimators. It's a nice property to have, and in situations where it works, it has a more objective feel than estimators which use bias to further reduce error.

Notice, the definition of bias is taking *fixed X*; that is, it's fixing the quantity which we *don't *know. Given a fixed X, the unbiased estimator's average value will equal X. This is a picture of bias which can only be evaluated "from the outside"; that is, from a perspective in which we can fix the unknown X.

A more inside-view of statistical estimation is to consider a fixed body of *evidence*, and make the estimator equal the average *unknown*. This is exactly inverse to unbiased estimation:

In the image, we want to estimate unknown Y from observed X. The two variables are correlated, just like in the earlier "tails come apart" scenario. The average-Y estimator tilts down because *good estimates tend to be conservative*: because I only have partial information about Y, I want to take into account what I see from X but also pull toward the average value of Y to be safe. On the other hand, *unbiased estimators tend to be overconfident*: the effect of X is exaggerated. For a fixed Y, the average Y^ is supposed to equal Y. However, for fixed Y, the X we will get will lean toward the mean X (just as for a fixed X, we observed that the average Y leans toward the mean Y). Therefore, in order for Y^ to be high enough, it needs to pull up sharply: middling values of X need to give more extreme Y^ estimates.

If we superimpose this on top of the tails-come-apart image, we see that this is something like a generalization:

## Wrapping It All Up

##
The punchline is that these two different regression lines were exactly what yields simultaneous underconfidence and overconfidence. The studies in conservatism were taking the *objective* probability as the independent variable, and graphing people's subjective probabilities as a function of that. The natural next step is to take the average subjective probability per fixed objective probability. This will tend to show underconfidence due to the statistics of the situation.

The studies on calibration, on the other hand, took the *subjective* probabilities as the independent variable, graphing average correct as a function of that. This will tend to show overconfidence, even with *the same data* as shows underconfidence in the other analysis.

From an individual's standpoint, the overconfidence is the real phenomenon. Errors in judgement tend to make us overconfident rather than underconfident *because errors make the tails come apart* so that if you select our most confident beliefs it's a good bet that they have only mediocre support from evidence, even if generally speaking our level of belief is highly correlated with how well-supported a claim is. Due to the way the tails come apart gradually, we can expect that the higher our confidence, the larger the gap between that confidence and the level of factual support for that belief.

This is *not* a fixed fact of human cognition pre-ordained by statistics, however. It's merely what happens due to random error. Not all studies show systematic overconfidence, and in a given study, not all subjects will display overconfidence. Random errors in judgement will tend to create overconfidence as a result of the statistical phenomena described above, but systematic correction is still an option.

*I've also written a simple simulation of this. Julia code is here. If you don't have Julia installed or don't want to install it, you can run the code online at JuliaBox.*

The punchline is that these two different regression lines were exactly what yields simultaneous underconfidence and overconfidence. The studies in conservatism were taking the *objective* probability as the independent variable, and graphing people's subjective probabilities as a function of that. The natural next step is to take the average subjective probability per fixed objective probability. This will tend to show underconfidence due to the statistics of the situation.

The studies on calibration, on the other hand, took the *subjective* probabilities as the independent variable, graphing average correct as a function of that. This will tend to show overconfidence, even with *the same data* as shows underconfidence in the other analysis.

From an individual's standpoint, the overconfidence is the real phenomenon. Errors in judgement tend to make us overconfident rather than underconfident *because errors make the tails come apart* so that if you select our most confident beliefs it's a good bet that they have only mediocre support from evidence, even if generally speaking our level of belief is highly correlated with how well-supported a claim is. Due to the way the tails come apart gradually, we can expect that the higher our confidence, the larger the gap between that confidence and the level of factual support for that belief.

This is *not* a fixed fact of human cognition pre-ordained by statistics, however. It's merely what happens due to random error. Not all studies show systematic overconfidence, and in a given study, not all subjects will display overconfidence. Random errors in judgement will tend to create overconfidence as a result of the statistical phenomena described above, but systematic correction is still an option.

*I've also written a simple simulation of this. Julia code is here. If you don't have Julia installed or don't want to install it, you can run the code online at JuliaBox.*

## Comments (6)

Controversial*0 points [-]The images seem to be missing.

If this is a draft you could have posted it as a draft and reviewed/edited it there. Note that it is possible to comment a draft.

First line of article currently says:

We also use the word bias when we speak about how funding sources can bias a researcher.

Isn't this just cognitive bias on part of the researcher?

No, the term cognitive bias suggests that someone is picking actions that are not correct according to his utility function. If a person is corrupt and acts in his interests that's no cognitive bias in the sense I understand the term to be used in psychology.

I agree, it's essentially different. It has to do with misaligned goals (AKA perverse incentives), similar to the principle-agent problem and the concept of lost purposes. This is all related to the tails-come-apart phenomenon as well, in that tails-come-apart is like a statistical version of the statement "you cannot serve two masters".