You are viewing revision 1.5.0, last edited by Eliezer Yudkowsky

Goodhart's Curse is a neologism for the combination of the Optimizer's Curse and Goodhart's Law, particularly as applied to the value alignment problem for Artificial Intelligences.

Goodhart's Curse in this form says that a powerful agent neutrally optimizing a proxy measure U, that we meant to align with true values V, will implicitly tend to find upward divergences of U from V.

In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we'd regard as an error in defining that utility function.

Winner's Curse, Optimizer's Curse, and Goodhart's Law

Winner's Curse

The Winner's Curse in auction theory says that if multiple bidders all bid their unbiased estimate of an item's value, the winner is likely to be someone whose estimate contained an upward error.

That is: If we have lots of bidders on an item, and each bidder is individually unbiased on average, selecting the winner selects somebody who probably made a mistake this particular time and overbid. They are likely to experience post-auction regret systematically, not just occasionally and accidentally.

For example, let's say that the true value of an item is $10 to all bidders. Each bidder bids the true value, $10, plus some Gaussian noise - each individual bidder is as likely to overbid $2 as to underbid $2, so each individual bidder's average expected bid is $10, an unbiased estimate of the true value. But the winning bidder is probably somebody who overbid $2, not somebody who underbid $2. So if we know that Alice won the auction, our revised posterior guess should be that Alice made an upward error in her bid.

Optimizer's Curse

The Optimizer's Curse in decision analysis generalizes this observation to a general agent that estimates the expected utility of actions, and executes the action with the highest expected utility. Even if each utility estimate is locally unbiased, the action with seemingly highest utility is much more likely, in our posterior estimate, to have an upward error in its expected utility.

Worse, the Optimizer's Curse means that actions with high-variance estimates are being selected for, in the process of selecting for upward errors. Suppose we're considering 5 possible actions which in fact have utility $10 each, and where our estimates of those utilities have Gaussian noise with a standard deviation of $2. Another 5 possible actions in fact have utility of -$20, and our estimate of each of these 5 actions is influnced by unbiased Gaussian noise with a standard deviation of $100. We are likely to pick one of the bad five actions whose enormous estimate-variance produced a huge upward error.

The Optimizer's Curse grows worse as a larger policy space is implicitly searched; the more options we consider, the higher the average upward error in whatever policy is selected. To put it another way, to effectively reason about a large policy space, we need to either have a good prior over policy goodness and to know the variance in our estimators, or we need very precise estimates, or we need the highest real points in the policy space to have an advantage bigger than the variance in our estimators.

The Optimizer's Curse is not exactly isomorphic to the Winner's Curse because it potentially applies even to implicit selection over large search spaces. Perhaps we're searching by gradient ascent rather than explicitly considering a trillion possible actions. We are nonetheless in some sense implicitly selecting over some effective search space. If we're estimating the value function to get the gradient, gradient ascent is implicitly seeking out any upward errors in the value estimate that the search happens to run across.

Goodhart's Law

Goodhart's Law is named after the economist Charles Goodhart; a standard formulation is "When a measure becomes a target, it ceases to be a good measure." Goodhart's original formulation is "Any observed statistical regularity will tend to collapse when pressure is placed upon it for control purposes."

For example, suppose we require banks to have '3% capital reserves' as defined some particular way. 'Capital reserves' measured that particular exact way will rapidly become a much less good indicator of the stability of a bank, as accountants fiddle with balance sheets to make them legally correspond to the highest possible level of 'capital reserves'.

Goodhart's Curse in alignment theory

Goodhart's Curse is a neologism (by Yudkowsky) for the crossover of the Optimizer's Curse with Goodhart's Law, yielding that neutrally optimizing a proxy measure U of V seeks out upward divergence between U and V.

Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.

Even if U is locally a good estimator of V, optimizing U will seek out what we would regard as 'errors in the definition', places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U - V is large... perhaps especially including places where the value learning system was subject to high variance and didn't work very well. From our perspective, we would call these "errors", that is, they would be upward errors in U that we regard as an estimate of V.

Furthermore, Goodhart's Curse grows worse as the AI becomes more powerful, because the AI is searching a larger space and therefore has more opportunity to uncover what we would regard as "errors", upward divergences between U and V. There is a potential context disaster if new divergences are uncovered as more of the possibility space is searched, etcetera.

We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our 'truly intended' V and is looking for some place that U or T diverges from V; but just because it is neutrally seeking out very large values of U or T and these are places where it is unusually likely that U or T diverged upward from V.

Many foreseeable difficulties of AGI alignment interact with Goodhart's Curse, and it can be seen as fundamental. Goodhart's Curse one of the central reasons we'd expect 'little tiny mistakes' to 'break' when we dump a ton of optimization pressure on them. Hence the claim: "AI alignment is hard like building a rocket is hard: enormous pressures will break things that don't break in less extreme engineering domains."

Goodhart's Curse also potentially applies to meta-utility functions

An obvious next question is "Why not just define the AI such that the AI itself regards U as an estimate of V, causing the AI's U to more closely align with V as the AI gets a more accurate empirical picture of the world?"

Reply: Of course this is the obvious thing that we'd want to do. But what if we make an error in exactly how we define "treat U as an estimate of V"? Goodhart's Curse will magnify and blow up any error in this definition as well.

We must distinguish:

  • V, the true value function that is in our hearts.
  • T, the external target that we formally told the AI to align on, where we are hoping that T really means V.
  • U, the AI's current estimate of T or probability distribution over T.

We may reasonably expect that U converges toward T as the AI becomes more advanced. The AI's epistemic improvements and learned experience will tend over time to eliminate Goodhart's Curse problems where the current estimate of U has diverged upward from the true formal target T.

However, Goodhart's Curse will still apply to any potential regions where T diverges upward from V, the true value function that is in our hearts. We'd be placing immense pressure toward seeking out what we would retrospectively regard as human errors in defining the meta-rule for determining utilities. [1]

Research avenues

Mild optimization is a proposed avenue for direct attack on the central difficulty of Goodhart's Curse and all the other difficulties it exacerbates. However, if our formulation of mild optimization is not perfect, Goodhart's Curse may well select for any place where our notion of 'mild optimization' turns out to have a loophole that allows a lot of optimization.

Similarly, conservative strategies can be seen as a somewhat more indirect attack on Goodhart's Curse--we try to stick to a conservative boundary drawn around previously whitelisted instances of the goal concept, or using strategies similar to previously whitelisted strategies; rather than searching a much huger space of possibilities that would be more likely to contain errors. But Goodhart's Curse may single out any human error in whatever way we defined 'similarity' or what constitutes a 'conservative' boundary, if our definition is less than absolutely perfect.

  1. ^︎

    That is, we'd retrospectively regard those as errors if we survived (with our minds unedited).