Goodhart's Curse - History

On "Conditions for Goodhart's curse": It seems like with AI the curse happens mostly when V is defined in terms of some high-level features of the state, which are normally not easily maximized. I.e., V is something like a neural network where $s$ is the state.

Now suppose U' is a neural network which outputs the AI's estimate of these features. The AI can then manipulate the state/input to maximize these features. That's just the standard problem of adversarial examples.

So it seems like the conditions we're looking for are generally met in the common setting were adversarial examples do work to maximize some loss function. One requirement there is that the input space is high-dimensional.

So why doesn't the 2D Gaussian example go wrong? There's no high-level features to optimize by using the flexibility of the input space.

On the other hand, you don't need a flexible input space to fall prey to the winner's curse. Instead of using the high flexibility of the input space you use the 'high flexibility' of the noise if you have many data points. The noise will take any possible value with enough data, causing the winner's curse. If you care about a feature that is bounded under the real-world distribution but the noise that is unbounded, you will find that the most promising-looking data points are maximizing the noise.

There's a noise-free (i.e. no measurement errors) variant of the winner's curse which suggests another connection to adversarial examples. If you simply have $n$ data points and pick the one that maximizes some outcome measure, you can conceptualize this as evolutionary optimization in the input space. Usually, adversarial examples are generated by following the gradient in the input space. Instead, the winner's curse uses evolutionary optimization.

Thomas Kwa2y*10

The exact conditions for Goodhart's Curse applying between V and a point estimate or probability distribution over U, have not yet been written out in a convincing way.

Drake Thomas and I believe we have made progress on this problem here.

SoerenMind7y*20

Another, speculative point:

If and $U$ were my utility function and my friend's, my intuition is that an agent that optimizes the wrong function would act more robustly. If true, this may support the theory that Goodhart's curse for AI alignment would be to a large extent a problem of defending against adversarial examples by learning robust features similar to human ones. Namely, the robust response may be because me and my friend have learned similar robust, high-level features; we just give them different importance.