The Optimizer's Curse and How to Beat It

lukeprog

101 The Optimizer's Curse and How to Beat It

by lukeprog

16th Sep 2011

3 min read

101

The best laid schemes of mice and men
Go often askew,
And leave us nothing but grief and pain,
For promised joy!

- Robert Burns (translated)

Consider the following question:

A team of decision analysts has just presented the results of a complex analysis to the executive responsible for making the decision. The analysts recommend making an innovative investment and claim that, although the investment is not without risks, it has a large positive expected net present value... While the analysis seems fair and unbiased, she can’t help but feel a bit skeptical. Is her skepticism justified?¹

Or, suppose Holden Karnofsky of charity-evaluator GiveWell has been presented with a complex analysis of why an intervention that reduces existential risks from artificial intelligence has astronomical expected value and is therefore the type of intervention that should receive marginal philanthropic dollars. Holden feels skeptical about this 'explicit estimated expected value' approach; is his skepticism justified?

Suppose you're a business executive considering n alternatives whose 'true' expected values are μ₁, ..., μ_n. By 'true' expected value I mean the expected value you would calculate if you could devote unlimited time, money, and computational resources to making the expected value calculation.² But you only have three months and $50,000 with which to produce the estimate, and this limited study produces estimated expected values for the alternatives V₁, ..., V_n.

Of course, you choose the alternative i* that has the highest estimated expected value V_i*. You implement the chosen alternative, and get the realized value x_i*.

Let's call the difference x_i* - V_i* the 'postdecision surprise'.³ A positive surprise means your option brought about more value than your analysis predicted; a negative surprise means you were disappointed.

Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased. It seems reasonable to expect that on average you will receive the estimated expected value of each decision you make in this way. Sometimes you'll be positively surprised, sometimes negatively surprised, but on average you should get the estimated expected value for each decision.

Alas, this is not so; your outcome will usually be worse than what you predicted, even if your estimate was unbiased!

Why?

...consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0. Suppose that the error in each [expected value] estimate has zero mean and standard deviation of 1, shown as the bold curve [below]. Now, as we actually start to generate the estimates, some of the errors will be negative (pessimistic) and some will be positive (optimistic). Because we select the action with the highest [expected value] estimate, we are obviously favoring overly optimistic estimates, and that is the source of the bias... The curve in [the figure below] for k = 3 has a mean around 0.85, so the average disappointment will be about 85% of the standard deviation in [expected value] estimates. With more choices, extremely optimistic estimates are more likely to arise: for k = 30, the disappointment will be around twice the standard deviation in the estimates.⁴

This is "the optimizer's curse." See Smith & Winkler (2006) for the proof.

The Solution

The solution to the optimizer's curse is rather straightforward.

...[we] model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ = (μ₁, ..., μ_n) and describe the accuracy of the value estimates V = (V₁, ..., V_n) by a conditional distribution V|μ. Then, rather than ranking alternatives. based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means...

The key to overcoming the optimizer’s curse is conceptually very simple: treat the results of the analysis as uncertain and combine these results with prior estimates of value using Bayes’ rule before choosing an alternative. This process formally recognizes the uncertainty in value estimates and corrects for the bias that is built into the optimization process by adjusting high estimated values downward. To adjust values properly, we need to understand the degree of uncertainty in these estimates and in the true values..⁵

To return to our original question: Yes, some skepticism is justified when considering the option before you with the highest expected value. To minimize your prediction error, treat the results of your decision analysis as uncertain and use Bayes' Theorem to combine its results with an appropriate prior.

Notes

¹ Smith & Winkler (2006).

² Lindley et al. (1979) and Lindley (1986) talk about 'true' expected values in this way.

³ Following Harrison & March (1984).

⁴ Quote and (adapted) image from Russell & Norvig (2009), pp. 618-619.

⁵ Smith & Winkler (2006).

References

Harrison & March (1984). Decision making and postdecision surprises. Administrative Science Quarterly, 29: 26–42.

Lindley, Tversky, & Brown. 1979. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, Series A, 142: 146–180.

Lindley (1986). The reconciliation of decision analyses. Operations Research, 34: 289–295.

Russell & Norvig (2009). Artificial Intelligence: A Modern Approach, Third Edition. Prentice Hall.

Smith & Winkler (2006). The optimizer's curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52: 311-322.

OptimizationMild optimizationAI

Frontpage

101

New Comment

Rendering 0/84 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 12:54 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

101 The Optimizer's Curse and How to Beat It

by lukeprog

16th Sep 2011

3 min read

101

The best laid schemes of mice and men
Go often askew,
And leave us nothing but grief and pain,
For promised joy!

- Robert Burns (translated)

Consider the following question:

A team of decision analysts has just presented the results of a complex analysis to the executive responsible for making the decision. The analysts recommend making an innovative investment and claim that, although the investment is not without risks, it has a large positive expected net present value... While the analysis seems fair and unbiased, she can’t help but feel a bit skeptical. Is her skepticism justified?¹

Of course, you choose the alternative i* that has the highest estimated expected value V_i*. You implement the chosen alternative, and get the realized value x_i*.

Alas, this is not so; your outcome will usually be worse than what you predicted, even if your estimate was unbiased!

Why?

...consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0. Suppose that the error in each [expected value] estimate has zero mean and standard deviation of 1, shown as the bold curve [below]. Now, as we actually start to generate the estimates, some of the errors will be negative (pessimistic) and some will be positive (optimistic). Because we select the action with the highest [expected value] estimate, we are obviously favoring overly optimistic estimates, and that is the source of the bias... The curve in [the figure below] for k = 3 has a mean around 0.85, so the average disappointment will be about 85% of the standard deviation in [expected value] estimates. With more choices, extremely optimistic estimates are more likely to arise: for k = 30, the disappointment will be around twice the standard deviation in the estimates.⁴

This is "the optimizer's curse." See Smith & Winkler (2006) for the proof.

The Solution

The solution to the optimizer's curse is rather straightforward.

...[we] model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ = (μ₁, ..., μ_n) and describe the accuracy of the value estimates V = (V₁, ..., V_n) by a conditional distribution V|μ. Then, rather than ranking alternatives. based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means...

The key to overcoming the optimizer’s curse is conceptually very simple: treat the results of the analysis as uncertain and combine these results with prior estimates of value using Bayes’ rule before choosing an alternative. This process formally recognizes the uncertainty in value estimates and corrects for the bias that is built into the optimization process by adjusting high estimated values downward. To adjust values properly, we need to understand the degree of uncertainty in these estimates and in the true values..⁵

Notes

¹ Smith & Winkler (2006).

² Lindley et al. (1979) and Lindley (1986) talk about 'true' expected values in this way.

³ Following Harrison & March (1984).

⁴ Quote and (adapted) image from Russell & Norvig (2009), pp. 618-619.

⁵ Smith & Winkler (2006).

References

Harrison & March (1984). Decision making and postdecision surprises. Administrative Science Quarterly, 29: 26–42.

Lindley, Tversky, & Brown. 1979. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, Series A, 142: 146–180.

Lindley (1986). The reconciliation of decision analyses. Operations Research, 34: 289–295.

Russell & Norvig (2009). Artificial Intelligence: A Modern Approach, Third Edition. Prentice Hall.

Smith & Winkler (2006). The optimizer's curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52: 311-322.

OptimizationMild optimizationAI

Frontpage

101

Mentioned in

74Neural uncertainty estimation review article (for alignment)

68Paths To High-Level Machine Intelligence

49We can do better than argmax

48Does Bayes Beat Goodhart?

37Simultaneous Overconfidence and Underconfidence

Load More (5/11)

New Comment

Rendering 0/84 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 12:54 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from lukeprog

Curated and popular this week

84Comments

Comment Permalink

Brickman15y00

I'm trying to figure out why, from the rules you gave at the start, we can assume that box 60 has more noise than the other boxes with variance of 20. You didn't, at the outset of the problem, say anything about what the values in the boxes actually were. I would not, taking this experiment, have been surprised to see a box labeled "200", with a variance of 20, because the rules didn't say anything about values being close to 50, just close to A. Well, I would've been surprised with you as a test-giver, but it wouldn't have violated what I understood the rules to be and I wouldn't have any reason to doubt that box was the right choice.

The box with 60 stands out among the boxes with high variance, but you did not say that those boxes were generated with the same algorithm and thus have the same actual value. In fact you implied the opposite. You just told me that 60 was an estimate of its expected value, and 37 was an estimate of one of the other boxes' expected values. So I would assign a very high probability to it being worth more than the box labeled 37. I understand that the variance is being effectively applied twice to go between the number on the box to the real number of coins (The real number of 45 could make an estimate anywhere from 25 to 65, but if it hit 25 I'd be assigning the real number a lower bound of 5 and if it hit 65 I'd be assigning the real number an upper bound of 85, which is twice that range). (Actually for that reason I'm not sure your algorithm really means there's a variance of 20 from what you state the expected value to be, but I don't feel like doing all the math to verify that since it's tangential to the message I'm hearing from you or what I'm saying). But that doesn't change the average. The range of values that my box labeled 60 could really contain from being higher than the range the box labeled 37 could really contain, to the best of my knowledge, and both are most likely to fall within a couple coins of the center of that range, with the highest probability concentrated on the exact number.

If the boxes really did contain different numbers of coins, or we just didn't have reason to assume that they don't contain different numbers, the box labeled 60 is likely to contain more coins than that 50/1 box did. It is also capable of undershooting 50 by ten times as much if unlucky, so if for some reason I absolutely cannot afford to find less than 50 coins in my box the 50/1 box is the safer choice--but if I bet on the 60/20 box 100 times and you bet on the 50/1 box 100 times, given the rules you set out in the beginning, I would walk away with 20% more money.

Or am I missing some key factor here? Did I misinterpret the lesson?

Manfred15y20

Or am I missing some key factor here? Did I misinterpret the lesson?

The key factor is that the 60,20 box is not in isolation - it is the top box, and so not only do you expect it to have more "signal" (gold) than average, you also expect it to have more noise than average.

You can think of the numbers on the boxes as drawn from a probability distribution. If there was 0 noise, this probability distribution would just be how the gold in the boxes was distributed. But if you add noise, it's like adding two probability distributions together. I... (read more)

See in context