Goodhart's Curse is a neologism for the combination of the Optimizer's Curse and Goodhart's Law, particularly as applied to the value alignment problem for Artificial Intelligences.

Goodhart's Curse in this form says that a powerful agent neutrally optimizing a proxy measure U that we hoped to align with true values V, will implicitly seek out upward divergences of U from V.

In other words: powerfully optimizing for a utility function is strongly liable to blow up anything we'd regard as an error in defining that utility function.

Winner's Curse, Optimizer's Curse, and Goodhart's Law

Winner's Curse

The Winner's Curse in auction theory says that if multiple bidders all bid their unbiased estimate of an item's value, the winner is likely to be someone whose estimate contained an upward error.

That is: If we have lots of bidders on an item, and each bidder is individually unbiased on average, selecting the winner selects somebody who probably made a mistake this particular time and overbid. They are likely to experience post-auction regret systematically, not just occasionally and accidentally.

For example, let's say that the true value of an item is $10 to all bidders. Each bidder bids the true value, $10, plus some Gaussian noise. Each individual bidder is as likely to overbid $2 as to underbid $2, so each individual bidder's average expected bid is $10; individually, their bid is an unbiased estimator of the true value. But the winning bidder is probably somebody who overbid $2, not somebody who underbid $2. So if we know that Alice won the auction, our revised guess should be that Alice made an upward error in her bid.

Optimizer's Curse

The Optimizer's Curse in decision analysis generalizes this observation to an agent that estimates the expected utility of actions, and executes the action with the highest expected utility. Even if each utility estimate is locally unbiased, the action with seemingly highest utility is more likely, in our posterior estimate, to have an upward error in its expected utility.

Worse, the Optimizer's Curse means that actions with high-variance estimates are selected for. Suppose we're considering 5 possible actions which in fact have utility $10 each, and our estimates of those 5 utilities are Gaussian-noisy with a standard deviation of $2. Another 5 possible actions in fact have utility of -$20, and our estimate of each of these 5 actions is influenced by unbiased Gaussian noise with a standard deviation of $100. We are likely to pick one of the bad five actions whose enormously uncertain value estimates happened to produce a huge upward error.

The Optimizer's Curse grows worse as a larger policy space is implicitly searched; the more options we consider, the higher the average error in whatever policy is selected. To effectively reason about a large policy space, we need to either have a good prior over policy goodness and to know the variance in our estimators; or we need very precise estimates; or we need mostly correlated and little uncorrelated noise; or we need the highest real points in the policy space to have an advantage bigger than the uncertainty in our estimates.

The Optimizer's Curse is not exactly similar to the Winner's Curse because the Optimizer's Curse potentially applies to implicit selection over large search spaces. Perhaps we're searching by gradient ascent rather than explicitly considering each element of an exponentially vast space of possible policies. We are still implicitly selecting over some effective search space, and this method will still seek out upward errors. If we're imperfectly estimating the value function to get the gradient, then gradient ascent is implicitly following and amplifying any upward errors in the estimator.

Goodhart's Law

Goodhart's Law is named after the economist Charles Goodhart. A standard formulation is "When a measure becomes a target, it ceases to be a good measure." Goodhart's original formulation is "Any observed statistical regularity will tend to collapse when pressure is placed upon it for control purposes."

For example, suppose we require banks to have '3% capital reserves' as defined some particular way. 'Capital reserves' measured that particular exact way will rapidly become a much less good indicator of the stability of a bank, as accountants fiddle with balance sheets to make them legally correspond to the highest possible level of 'capital reserves'.

Decades earlier, IBM once paid its programmers per line of code produced. If you pay people per line of code produced, the "total lines of code produced" will have even less correlation with real productivity then it had previously.

Goodhart's Curse in alignment theory

Goodhart's Curse is a neologism (by Yudkowsky) for the crossover of the Optimizer's Curse with Goodhart's Law, yielding that neutrally optimizing a proxy measure U of V seeks out downward divergence of V from U.

Suppose the humans have true values V. We try to convey these values to a powerful AI, via some value learning methodology that ends up giving the AI a utility function U.

Even if U is locally an unbiased estimator of V, optimizing U will seek out what we would regard as 'errors in the definition', places where U diverges upward from V. Optimizing for a high U may implicitly seek out regions where U - V is high; that is, places where V is lower than U. This may especially include regions of the outcome space or policy space where the value learning system was subject to great variance; that is, places where the value learning worked poorly or ran into a snag.

Goodhart's Curse would be expected to grow worse as the AI became more powerful. A more powerful AI would be implicitly searching a larger space and would have more opportunity to uncover what we'd regard as "errors"; it would be able to find smaller loopholes, blow up more minor flaws. There is a potential context disaster if new divergences are uncovered as more of the possibility space is searched, etcetera.

We could see the genie as implicitly or emergently seeking out any possible loophole in the wish: Not because it is an evil genie that knows our 'truly intended' V and is looking for some place that V can be minimized while appearing to satisfy U; but just because the genie is neutrally seeking out very large values of U and these are places where it is unusually likely that U diverged upward from V.

Many foreseeable difficulties of AGI alignment interact with Goodhart's Curse. Goodhart's Curse is one of the central reasons we'd expect 'little tiny mistakes' to 'break' when we dump a ton of optimization pressure on them. Hence the claim: "AI alignment is hard like building a rocket is hard: enormous pressures will break things that don't break in less extreme engineering domains."

Goodhart's Curse also potentially applies to meta-utility functions

An obvious next question is "Why not just define the AI such that the AI itself regards U as an estimate of V, causing the AI's U to more closely align with V as the AI gets a more accurate empirical picture of the world?"

Reply: Of course this is the obvious thing that we'd want to do. But what if we make an error in exactly how we define "treat U as an estimate of V"? Goodhart's Curse will magnify and blow up any error in this definition as well.

We must distinguish:

V, the true value function that is in our hearts.

T, the external target that we formally told the AI to align on, where we are hoping that T really means V.

U, the AI's current estimate of T or probability distribution over T.

We may reasonably expect that U converges toward T as the AI becomes more advanced. The AI's epistemic improvements and learned experience will tend over time to eliminate a subclass of Goodhart's Curse where the current estimate of U has diverged upward from T, where the uncertain U was selected to be above the true formal target T over which the AI's uncertainty is defined.

However, Goodhart's Curse will still apply to any potential regions where T diverges upward from V, the true value function that is in our hearts. We'd be placing immense pressure toward seeking out what we would retrospectively regard as human errors in defining the meta-rule for determining utilities. ^[1]

Similarly, there is a Bayesian remedy for the Optimizer's Curse in which we have a prior on the expected utilities and we are more skeptical of very high estimates. But a search over a very wide effective space would be expected to blow up any flaws in this prior--seek out any loopholes in the attempted remedy of Goodhart's Curse.

This is one reason why 'moral uncertainty' may not be a panacea for AI alignment, especially since the AI must be sufficiently morally certain about some things to sometimes act and produce useful outputs. Goodhart's Curse would tend to seek out regions where this "sufficient level of moral certainty" happened to be, from our perspective, misaligned. (Unless this "moral uncertainty" calculation never underestimated the true difficulty-from-our-standpoint in the AI's estimation of the true V; and this conservative overestimate had not been bypassed at any point by programmers who wanted to have the AI occasionally act rather than it always refusing to act from fear of inestimable catastrophes.)

Research avenues

Mild optimization is a proposed avenue for direct attack on the central difficulty of Goodhart's Curse and all the other difficulties it exacerbates. However, if our formulation of mild optimization is not perfect, Goodhart's Curse may well select for any place where our notion of 'mild optimization' turns out to have a loophole that allows a lot of optimization.

Similarly, conservative strategies can be seen as a somewhat more indirect attack on Goodhart's Curse--we try to stick to a conservative boundary drawn around previously whitelisted instances of the goal concept, or using strategies similar to previously whitelisted strategies; rather than searching a much huger space of possibilities that would be more likely to contain errors. But Goodhart's Curse may single out any human error in whatever way we defined 'similarity' or what constitutes a 'conservative' boundary, if our definition is less than absolutely perfect.

^{^︎}
That is, we'd retrospectively regard those as errors if we survived (with our minds unedited).