There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.
Following this terminology, we'll say that is the true goal, and is the proxy. In the range of circumstances we're used to, - that's what's makes a good proxy. Then the Goodhart problem has two aspects to it:
- Maximising does not increase as much as maximising would.
- When strongly maximising , starts to increase at a slower rate, and ultimately starts decreasing.
Aspect 1. is a tautology: the best way to maximise is to... maximise . Hence maximising is almost certainly less effective at increasing than maximising directly.
But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs . For instance, some pairs have the reverse Goodhart problem:
- When strongly maximising , starts to increase at a faster rate, and ultimately starts increasing more than twice as fast as .
Are there utility functions that have anti-Goodhart problems? Yes, many. If have a Goodhart problem, then has an anti-Goodhart problem if .
Then in the range of circumstances we're used to, . And, as starts growing slower than , starts growing faster; when starts decreasing, starts growing more than twice as fast as :
Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you're a total or average utilitarian, and you maximise the proxy "do the best for the worst off". In general, if is your true utility and is a prioritarian/conservative version of (eg or or other concave, increasing functions) then we have reverse Goodhart behaviour[1].
So saying that we expect Goodhart problems (in the second sense) means that we know something special about (and ). It's not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.
We also need to scale the proxy so that on the typical range of circumstances; thus the conservatism of is only visible away from the typical range. ↩︎
I ended up using mathematical language because I found it really difficult to articulate my intuitions. My intuition told me that something like this had to be true mathematically, but the fact that you don't seem to know about it makes me consider this significantly less likely.
Yes, but V also happens to be very strongly correlated with most U that are equal to V. That's where you do the cheating. Goodhart's law, as I understand it, isn't a claim about any single proxy-goal pair. That would be equivalent to claiming that "there are no statistical regularities, period". Rather, it's a claim about the nature of the set of all potential proxies.
In a Bayesian language, Goodhart's law sets the prior probability of any seemingly good proxy being a good proxy, which is virtually 0. If you have additional evidence, like knowing that your proxy can be expressed in a simple way using your goal, then obviously the probabilities are going to shift.
And that's how your V and V′ are different. In the case of V, the selection of U is arbitrary. In the case of V′, the selection of U isn't arbitrary, because it was already fixed when you selected V′. But again, if you select a seemingly good proxy U′ at random, it won't be an actually good proxy.