One other issue, which I'm not sure you've touched on, is the fact that variables in the real world are rarely completely independent. That is to say, increasing a variable, say V(1) may in fact lower other variables V(2)...V(N), including one of the 5 "important" variables that are highly weighted. For example, if I value both a clean environment and maximizing my civilization's energy production, I have to balance the fact that maximizing energy production might involve strip-mining a forest or two, lowering the amount of clean environment available to the people.
Secondly, how does this model deal with adversarial agents? One of the reasons that Goodhart's Law is so pervasive in the real world is that the systems it applies to often have an adversarial component. That is to say, there are agents who notice that you are pouring energy into V. In the past, all of this energy would have gone straight into U, but now that agents realize that there is a surplus of energy, they divert some of it to their own ends, reducing or even eliminating the total surplus that goes into U.
Finally, how well does this model deal with the fact that human values might change over time? If the set of 100 things the humans care about changes over time, how does that affect the expectation calculation?
>variables in the real world are rarely completely independent
To some extent, the diminishing returns of investing the agent's "budget" captures this non-independence dynamic (increasing one variable must reduce some other, because there is less budget to go along). More complicated trade-offs seem to be modelleable in a similar way.
>Secondly, how does this model deal with adversarial agents?
It doesn't, not really.
>Finally, how well does this model deal with the fact that human values might change over time?
It doesn't; those are more advanced considerations; see eg https://www.lesswrong.com/posts/Y2LhX3925RodndwpC/resolving-human-values-completely-and-adequately
This is an important point, and a useful approach - but I think it addresses only some forms of Goodhart-errors. Specifically, it doesn't address regime change, where the relationship to known important variables changes after optimization, or causal errors, where the action taken have perverse effects on variables known to be important. (And multiparty Goodhart effects are partly included, but I'm still working on figuring out how to formalize and address them more clearly.)
>where the relationship to known important variables changes after optimization
If you expect that to happen, and include it in the utility, then you'd start getting more conservative optimisation behaviour.
Another important challenge it to list the different possible variables that humans might care about; in the example above, we were given the list of a 1000, but what if we didn't have it? Also, those variables could only go one way - up. What if there were a real-valued variable that the agent suspected humans cared about - but didn't know whether we wanted it to be high or low?
This to me seems the fatal problem with this approach. Although I don't have a proof for it, my suspicion is that we have so many variables that we would not be able to perform the computations necessary as you describe to mitigate Goodharting because it would not actually be able to take all the variables into account during optimization.
Goodhart's curse happens when you want to maximise an unknown or uncomputable U, and instead choose a simpler proxy V, and maximise that instead. Then the optimisation pressure on V transforms it into a worse proxy of U, possibly resulting in a very bad result from the U-perspective.
Many suggestions for dealing with this involve finding some other formalism, and avoiding expected utility maximisation entirely.
However, it seems that classical expected utility maximisation can work fine, as long as we properly account for all our uncertainty and all our knowledge.
The setup
Imagine that there are 1000+5 different variables that humans might care about; the default value of these variables is 0, and they are all non-negative.
Of these, 5 are known to be variables that humans actually care about, and would like to maximise as much as possible. Of the remaining 1000, an AI agent knows that humans care about 100 of these, and want their values to be high - but it doesn't know which 100.
The agent has a "budget" of 1000 to invest in any variables it chooses; if it invest X in a given variable v, that variable is set to √X. Thus there is a diminishing return for every variable.
Then set
And we get a classic Goodhart scenario: the agent will invest 200 in each of these variables, setting them to just about 14, and ignore all the other variables.
It knows we care, but not about what
In that situation, we have not, however, incorporated all that we know into the definition of V. We know that there are a 100 other variables humans care about. So we can define V as:
Of course, we don't know what the 100 variables are; but we can compute the expectation of V.
In order to make the model more interesting, and introduce more complicated trade-offs, I'll designate some of the 1000 variables as "stiff" variables: these require 40 times more investment to reach the same value. The number of stiff variables varies between 1 and 99 (to ensure that there is always at least one variable the humans care about among the non-stiff variables).
Then we, or an agent, can do classical expected utility maximisation on V:
This graph plots various values against the number of stiff variables. The orange dots designate the values of the 5 known variables; the agent prioritises these, because they are known. At the very bottom we can make out some brown dots; these are the values of the stiff variables, which hover barely above 0: the agent wastes little effort on these.
The purple dots represent the values of the non-stiff variables; the agent does boost them, but because they have only 1/10 of being variables humans care about, they get less priority. The blue dots represent the expected utility of V given all the other values; it moves from 0.13 approximately to 0.12 as the number of stiff variables rises.
This is better than only maximising the 5 known variables, but still seems sub-optimal: after all, we know that human value is fragile, so we lose a lot by having the stiff variables so low, as they are very likely to contain things we care about.
The utility knows that value is fragile
*We* know that human value is fragile, but that has not yet been incorporated into the utility. The simplest way would be to define V as:
Again, V cannot be known by the agent, but its expected value can be calculated. For different number of stiff variables, we get the following behaviour:
The purple dots are the values that the agent sets the 5 known variables, and the un-stiff unknown variables to. Because V is defined as a minimum, and because at least one of the un-stiff unknown variables must be one the humans care about, the agent will set them all to the same value. The brown dots track the value of the stiff variables, while the blue points are the expected value of V.
Initially, when there are few stiff variables, the agent invests their efforts mainly into the other variables, hoping that humans don't care about any of the stiff variables. As the number of stiff variables increases, the probability that humans care about at least one of them also increases. By the time there are 40 stiff variables, it's almost a certainty that one of them is one of the 100 the humans care about; at that point, the agent has to essentially treat V as being the minimum of all 1000+5 variables. The values of all variables - and hence the expected utility - then continues to decline as the number of stiff variables further increases, which makes it more and more expensive to increase all variables.
This behaviour is much more conservative, and much closer to what we'd want the agent to actually be doing in this situation; it does not feel Goodharty at all.
Using expected utility maximisation for Good(hart)
So before writing off expected utility maximisation as vulnerable to Goodhart effects, check if you've incorporated all the information and the uncertainty that you can, into the utility function.
The generality of this approach
It is not a problem, for this argument, if the number of variables humans care about is unknown, or if the tradeoff is more complicated than above. A probability distribution over the number of variables, and a more complicated optimal policy, would resolve these. Nor is the strict "min" utility formulation needed; a soft-min (or a mix of soft-min and min, depending on the importance of the variables) would also work, and allow the utility-maximiser to take less conservative tradeoffs.
So, does the method generalise? For example, if we wanted to maximise CEV, and wanted to incorporate my criticisms of it, we could add the criticisms as a measure of uncertainty to the CEV. However, it's not clear how to transform my criticisms into a compact utility-function-style form.
More damningly, I'm sure that people could think of more issues with CEV, if we gave them enough time and incentives (and they might do that as part of the CEV process itself). Therefore we'd need some sort of process that scans for likely human objections to CEV and automatically incorporates them into the CEV process.
It's not clear that this would work, but the example above does show that it might function better than we'd think.
Another important challenge it to list the different possible variables that humans might care about; in the example above, we were given the list of a 1000, but what if we didn't have it? Also, those variables could only go one way - up. What if there were a real-valued variable that the agent suspected humans cared about - but didn't know whether we wanted it to be high or low?
We could generate a lot of these variables in a variety of unfolding processes (processes that look back at human minds and use that to estimate what variables matter, and where to look for new ones), but that may be a challenge. Still, something to think about.