In the context of the SatisfIA project, we came up with two more models, one motivated by a pure exchange model (a standard model of a market), the other assuming that the agent estimates utility from the provided ranking among a sample of candidate actions. Although these are toy models for real situations, they may be interesting for further investigation of the conditions under which Goodhart-style behavior occurs.
Model 1: Purchasing goods
In this model, an agent acts on behalf of a household[1]. Its task is to go shopping at a market where some number of goods are sold and choose how much of each good to buy with its limited budget. If the agent does not accurately know how much the household values each good, then the choices it makes under uncertainty will be suboptimal with respect to the true preferences.
Introduction: Two-good tradeoffs
Suppose the market sells two goods, A and B. Assuming the agent spends its entire budget, there is only one degree of freedom — the fraction spent on B — which means we can fit all possible policies on one axis. The plot below shows an instance of what the true and estimated utility functions in this model could look like, for some fixed parameters[2]:
We see that spending all of our budget on good A (the left edge of the plot) or all of our budget on good B (the right edge) is very bad for both the true and estimated utility function: if we entirely forget to buy one of the goods which are important to us, utility goes to negative infinity!
To correctly maximize true utility (the blue curve), one should choose the blue point. However, if the agent is basing its decisions on the estimated utility function (yellow curve), it will choose to spend a larger fraction of its budget on good B, because the estimated utility function values B more than the true utility function. The agent will think it reaches the utility indicated by the yellow point, at the maximum of the estimated utility function, but in fact it will reach the green point on the true utility curve. This leads to a loss in utility corresponding to the distance between the two gray dotted lines.
Full model
Suppose the market sells k different goods, each with fixed unit price pj,j∈{1,…,k}. The agent must decide on the amount xj of each good to buy, under the budget constraint ∑jxjpj≤B.
Performance on this task is measured by the household's true utility function u, which we assume depends only on the amounts xj and has the following particular form[3]:
u(x)=logk∏j=1xcjj=k∑j=1cjlogxj,
where the coefficients cj≥0 represent how valuable good j is to the household. This utility function exhibits decreasing marginal returns in each of the xj, and we assume that it is additive, in the sense that the preferences of household members over outcome lotteries are given by the expected value of u.
We can simplify the expression of the constraint by introducing new variables, fj=xjpjB, which represent the fraction of the total budget B spent on good j. The budget constraint then simply becomes ∑jfj≤1.
By the power of logarithms, the utility function splits nicely into a term depending only on the fractions spent fj and values cj, and a second term depending only on the prices pj and budget B:
where C=∑jcj. Since we are interested here in comparing different policies, i.e. different choices of the fj, the uniform offset in utility produced by the prices pj and budget B may be ignored, and we are left with u(f)=C∑jcjClogfj.
Now, since utility is an increasing function of the fractions fj, it is never optimal to spend less than the entire budget, since we can always increase utility by using the remaining budget to buy any arbitrary good. Hence, we can assume the entire budget is always spent, ie ∑jfj=1. Since the fractions spent fj, as well as the ratios cjC, are positive and sum to 1, they are akin to probability distributions, and we notice that u(f) is mathematically, up to a factor of −C, the cross-entropy of the "budget distribution" given by the fj relative to the "value distribution" given by the cjC!
Cross-entropy is minimized whenever the two distributions are equal, so the best possible budget allocation is f∗j=cjC, effectively spending on goods proportionally to their value.[4] The maximum utility is then
However, suppose now that the agent does not know the true value coefficients, and instead only has access to imprecise estimates ^cj of the true cj, perhaps because the household does not accurately describe its preferences. If the agent maximizes the proxy utility function ^u(f)=∑j^cjlog(fj), then it will choose the best possible fj according to its estimate, ^fj=^cj^C[6], and the true utility obtained will then be
~u=u(^f)=C∑jcjClog(^cj^C).
The utility lost due to misspecification is then
L=u∗−^u=C∑jcjClog(cj/C^cj/^C),
which is precisely the K-L divergence of the real value coefficients cj from the estimated ones ^cj!
Big losses from forgotten values
A common worry about utility-maximizing agents is that we would give them an incomplete description of human preferences, entirely forgetting some aspect of the world that we do value. Such an agent would then optimize for the proxy utility function we gave it, neglecting this forgotten aspect and leading to outcomes which we find very bad even though they score highly on the metrics we thought of.
In this model, the above situation would correspond to having ^cj≪cj for some j. This makes the quotient cj^cj very large and induces high utility loss, going to infinity in the limit ^cj→0.[7]
Average utility loss
We have determined that utility is lost in any instance of this scenario where the normalized estimated value coefficients ^cj^C differ from the true normalized value coefficients cjC. In order to quantify this loss without choosing any particular arbitrary values for the coefficients cj and ^cj, we can instead choose probability distributions from which they are drawn[8], and determine the average loss.
More specifically, let's assume that the true value coefficients cj are independently log-normally distributed, with logcj∼N(0,η2) for some "goods heterogeneity" parameter η≥0; the case η=0 corresponds to having all goods be equally valuable, and as η increases goods become more likely to have very different values. Likewise, we assume that the misspecification ratios ^cj/cj are also log-normally distributed, independently from one another and from all the coefficients cj, with log^cjcj∼N(0,σ2) for some "misspecification degree" parameter σ>0. Estimates are perfectly accurate, i.e. ^cj=cj, when σ=0, and less precise when σ increases.
Now, observe that the difference between optimal utility and utility reached by proxy-maximization may be written as
L=∑jcjlog(cj^cj)+Clog^CC.
Since the misspecification ratios cj^cj are assumed to be lognormally distributed around 1 and independent of the value coefficients cj, we have
E(cjlog(cj^cj))=E(cj)⋅E(log(cj^cj))=E(cj)⋅0=0
and hence E(L)=E(Clog(^CC)).
Since ^C is the sum of k independent identically distributed coefficients ^cj, we can consider ^C to be relatively close to its expected value when k is large, and likewise for C. This would suggest the approximations C≈E(C)=k⋅E(c1)=keη22 and ^C≈keη2+σ22, which yield
E(L)≈k⋅σ22⋅eη22.
We would expect this approximation to be better for large k and relatively small η and σ, and eyeballing numerical simulations, it seems that this is indeed the case. The expected utility loss is larger the more uncertain we are about the true value coefficients (ie when σ is large), and it also grows with k and with η.
Numerical evidence for Goodhart effect
Tautologically, optimizing for a proxy utility function yields less good results than directly optimizing the true utility function. However, it could still be the case that optimizing the proxy utility function is "the best one can do", in the sense that on average, actions ranked higher by the proxy utility function are in fact better in terms of true utility, even if they are not as good as the truly optimal action. If this is not the case — that is, if, beyond some quantile of proxy ranking, true quality of actions ceases to increase with increasing proxy rank — then we have an instance of the Goodhart effect.[9]
To test whether this model demonstrates the Goodhart effect, we implemented the following in Python:
Choose some number Nutility of utility functions, each consisting of k value coefficients cj drawn from the lognormal distribution with parameter η
For each utility function, choose Nestimate different estimated utility functions, each consisting of k estimated value coefficients ^cj drawn from the lognormal distribution with median cj and parameter σ
Choose Npoints policies, each consisting of k budget fractions (summing to 1), drawing from the uniform distribution on the space of distributions over k goods
Evaluate the Nutility true and Nutility⋅Nestimate estimated utility functions at each of the Npoints policies
Rank the Npoints different policies according to all estimated and true utility functions
We then aggregate this data according to the rankings, and produce plots which look like the following:
This graph can be understood as follows: with these parameters, choosing the policy ranked (for example) 200th by the estimated utility function yields an average true utility of about -50. Choosing the very best policy according to the estimated utility function yields about -37 utility on average, an improvement! Of course, there will be individual instances in which the estimated-best policy is not the true best policy, but here we average over many realizations of this process, which includes random generation of a true utility function and estimated value coefficients. In fact, we see that the average utility obtained at a given estimated rank increases with the rank all the way, so there is no Goodhart effect here[9]: making decisions by always taking the action ranked highest by the estimated utility function is better on average, in terms of the true utility function, than always choosing, say, the policy which is ranked at 90th-percentile by the estimate.
This plot follows a slightly different approach, showing five statistics of the true ranking as a function of the estimated rank. For example, we can read from the values of the yellow, green and red curves at x=800 that the policy with estimated rank 800 has 75% probability of being ranked higher than ~560 by the true utility function, 50% probability of having true rank at least ~750 and 25% probability of having a true rank better than ~860, respectively. As with the average true utility, the true-rank-quantiles are all increasing functions of the estimated rank, which indicates absence of a Goodhart effect.[10]
However, this changes if we modify the parameters! Let us now increase σ, the parameter governing error in the estimated value coefficients, from 1 to 3.
Here, we see that the quality of policies improves as estimated rank increases to about 950, but then decreases sharply in the top 50! With these parameters, the strategy "always pick the 95th-percentile policy" (according to the estimated-utility ranking) is superior to the strategy "always pick the highest-ranked policy". This is an instance of the Goodhart effect: improving the proxy metric, estimated utility, is a reasonable way to improve the thing we care about, true utility, until we attempt to pursue it to its extreme.
Playing around a bit, the Goodhart effect is stronger in this model for high values of the misspecification degree σ, low values of the goods heterogeneity η and small numbers of goods k.[11]
Finally, it is worth mentioning that we have assumed in this post that the agent always takes the estimated value coefficients at face value; for some thoughts on a Bayesian approach, see this footnote: [12].
Model 2: Utility estimation from rankings of samples
Our second model is more abstract and is based on the idea that the agent learns about a human's preferences only from an ordinal preference ranking provided by the human. This ranking is only provided over a finite subset of all possible states of the world, and the agent then tries to reconstruct the full utility function from this information, and subsequently takes decisions based on the obtained proxy utility function. This can be considered analogous to procedures such as RLHF where an AI is intended to infer some flavor of human value from a limited number of examples.
State space, true utility
Suppose the human cares about d∈N∗ separate quantities xi, which may each vary between −1 and 1. Accordingly, the world state space is a d-dimensional box, X=[−1,1]d. We assume the human's true utility function, u, is a polynomial in d variables with the form[13]
u(x)=f(x)d∏i=1(1−x2i),(1)
where f(x) is some polynomial in d variables.
Estimated utility/proxy formation
Some N example states e1,e2,…,eN are chosen from the state space X, and the human informs the agent about their preference ordering over these examples. To model the fact that the human may not accurately report their preferences, we suppose that the human internally evaluates the utility of each point, subject to some random noise ε, yielding estimates ~ui=u(ei)+ε. The human then tells the agent its ranking of the example states according to the estimates ~ui. The agent is only given a possibly erroneous ranking of example states and does not have access to the human's estimates ~ui.
Next, the agent tries to reconstruct the user's underlying utility function by guessing the polynomial f. We assume that the agent does this by a procedure similar to LASSO regression, minimizing the L1-norm of the coefficients of the guessed polynomial ~funder the constraint that ~u(x)+1≤~u(y) whenever the user reported state y is preferable to state x.[14]
Numerical results
We simulated this process Niterations=100 times, using a state space with dimension d=5. We chose to use polynomials f with degree c=3 in each variable, and chose the coefficients uniformly from [−1,1]. The preference ranking was reported over N=20example states drawn uniformly from the state space, with reporting errors ε drawn from the normal distribution N(0,σ2) with σ=0.1.
To evaluate the reconstructed proxy utility functions ^u, we generated M=1000 evaluation states and compared their true rank (according to u) with the proxy rank (according to the proxy utility function ^u), as before:
For example, the state receiving the best proxy rank ^r=999 was among the top 100 states according to the true utility function in about a quarter of the simulations, among the top states 500 states in approximately half of simulations, but also among the bottom 200 states in about a quarter of simulations.
In other words, an agent optimizing the so estimated proxy utility function would have a 25% chance of picking an outcome that is actually among the worst 20% outcomes in this model. Note that this is worse than random! An agent optimizing a random function would only have a 20% chance of picking an outcome that is actually among the worst 20% outcomes.
For the purposes of this model, we abstract the household as one individual with coherent preferences. Of course, as the entire field of social choice theory tells us, aggregating multiple household members' preferences is a difficult problem.
The parameters here (see definition of the utility function below) are cA=3,cB=6,^cA=0.2,^cB=1.2. This is quite a large divergence between the real values and the estimates, chosen because it made the plot prettier.
This can be motivated by the theory of "household production": The goods xj are used to "produce" the household's "actual" consumption good, y, using a Cobb-Douglas production function, y=f(x)=∏kj=1xcjj, with elasticities cj, and the utility resulting from that actual consumption good is logarithmic in the produced amount y.
It is a good sanity check that the optimal policy depends only on the relative values of the goods; scaling up the value of every good by the same factor corresponds to a rescaling of the utility function u, which may not affect preferences.
The agent may have a Bayesian belief distribution over possible values of the factors cj. Since utility is linear in the cj, the expected utility of a given action is E(u(f))=E(∑jcjlogfj)=∑jE(cj)logfj and hence the agent will act exactly as if it were maximizing the single utility function ^u with ^cj=E(cj).
This argument is somewhat incomplete, since it suggests that utility loss will be negative if we overestimate the value of a good!
Suppose we have overestimated the value of some good, ie ^cj≫cj. The ratio cj^cj will be very small, and the term cjClog(cj/C^cj/^C) will indeed have a negative contribution to the loss. However, the ratio C^C will be very small as well, and this has the effect of increasing loss; as this is applied over all terms in the sum, this dominates and loss is indeed positive if we overestimate the value of one good.
This effect of the ratio C^C having a larger influence than the ratio cj^cj does not apply in the undervaluing case of one ^cj≪cj, as in that case ^C still contains the values of all the other goods and does not go to zero.
The choice of these distributions is still somewhat arbitrary, but less so, since we choose only two real parameters η and σ instead of choosing all 2k values of the cj and ^cj.
Note that there could be two distinct questions here: - Given one real utility function (a set of values cj) and one estimate (a set of estimated values cj), is true value an increasing function of estimated value? - Given some distributions from which utility functions and estimates are drawn, is average true value an increasing function of estimated rank? The first question reasons "ex post", and its answer is no in most cases. The second question reasons "ex ante", so a clarifying name for the type of Goodhart effect under investigation might be "ex ante Goodhart".
It is somewhat interesting that this plot is asymmetric: the lines converge on the lower-left corner, but a gap remains in the upper right. This is because
each optimal policy (for various utility functions in this model) is optimal in its own way; the worst policies are all alike.
More specifically, the proxy-worst policies are points on the edge of the simplex, setting something valued very close to zero, which is also very bad for any true utility function in the family we are sampling from. The proxy-best policies are in the interior of the simplex and may differ substantially from the truly-best policies, so the curves remain separate to the right.
The fact that the Goodhart effect is easier to observe when k is small is possibly due to dimensionality effects: if the space of policies has large dimension, then the Npoints uniformly-chosen policies will mostly be mediocre, and only a small fraction of them will be anywhere close to optimal for the estimated or true utility functions.
One could also examine cases where the agent is a good Bayesian and has knowledge of the random processes that determine the estimated value coefficients ^cj from the true value coefficients cj ; in this case, this would correspond to knowing the parameters η and σ, which determine the shape of the prior. The estimates ^cj would then serve as evidence, and the agent would base its decisions on its posterior beliefs about the true values of the cj. The calculations are quite straightforward, since everything in this model is nicely (log-)gaussian. Our understanding is that such an agent will not exhibit the Goodhart effect if its belief about η and σ matches reality, but that it may show Goodhart effect when η and σ are not accurately known.
This form was chosen because it fixes utility to zero on the boundary of the state space, which causes the optimal state to be in the interior of the state space X.
The more natural-seeming constraint that u(x)<u(y) whenever y was reported preferable to x has the issue that optimization yields a proxy utility function where all example states are very close together, so we force an arbitrary separation.
For how maximizing a misaligned proxy utility function can go wrong, there are already many concrete examples (e.g., the "no clickbait" database or Gao et al., 2022), some theoretical models (e.g., Zhuang et al., 2021), and discussions (e.g., this post, this AISC team report).
In the context of the SatisfIA project, we came up with two more models, one motivated by a pure exchange model (a standard model of a market), the other assuming that the agent estimates utility from the provided ranking among a sample of candidate actions.
Although these are toy models for real situations, they may be interesting for further investigation of the conditions under which Goodhart-style behavior occurs.
Model 1: Purchasing goods
In this model, an agent acts on behalf of a household[1]. Its task is to go shopping at a market where some number of goods are sold and choose how much of each good to buy with its limited budget. If the agent does not accurately know how much the household values each good, then the choices it makes under uncertainty will be suboptimal with respect to the true preferences.
Introduction: Two-good tradeoffs
Suppose the market sells two goods, A and B. Assuming the agent spends its entire budget, there is only one degree of freedom — the fraction spent on B — which means we can fit all possible policies on one axis. The plot below shows an instance of what the true and estimated utility functions in this model could look like, for some fixed parameters[2]:
We see that spending all of our budget on good A (the left edge of the plot) or all of our budget on good B (the right edge) is very bad for both the true and estimated utility function: if we entirely forget to buy one of the goods which are important to us, utility goes to negative infinity!
To correctly maximize true utility (the blue curve), one should choose the blue point. However, if the agent is basing its decisions on the estimated utility function (yellow curve), it will choose to spend a larger fraction of its budget on good B, because the estimated utility function values B more than the true utility function. The agent will think it reaches the utility indicated by the yellow point, at the maximum of the estimated utility function, but in fact it will reach the green point on the true utility curve. This leads to a loss in utility corresponding to the distance between the two gray dotted lines.
Full model
Suppose the market sells k different goods, each with fixed unit price pj,j∈{1,…,k}. The agent must decide on the amount xj of each good to buy, under the budget constraint ∑jxjpj≤B.
Performance on this task is measured by the household's true utility function u, which we assume depends only on the amounts xj and has the following particular form[3]:
u(x)=logk∏j=1xcjj=k∑j=1cjlogxj,where the coefficients cj≥0 represent how valuable good j is to the household. This utility function exhibits decreasing marginal returns in each of the xj, and we assume that it is additive, in the sense that the preferences of household members over outcome lotteries are given by the expected value of u.
We can simplify the expression of the constraint by introducing new variables, fj=xjpjB, which represent the fraction of the total budget B spent on good j. The budget constraint then simply becomes ∑jfj≤1.
By the power of logarithms, the utility function splits nicely into a term depending only on the fractions spent fj and values cj, and a second term depending only on the prices pj and budget B:
u(f)=∑jcjlog(Bfjpj)=∑jcjlog(fj)−∑jcjlog(pi)+ClogB=C∑jcjClogfj+const.,where C=∑jcj. Since we are interested here in comparing different policies, i.e. different choices of the fj, the uniform offset in utility produced by the prices pj and budget B may be ignored, and we are left with u(f)=C∑jcjClogfj.
Now, since utility is an increasing function of the fractions fj, it is never optimal to spend less than the entire budget, since we can always increase utility by using the remaining budget to buy any arbitrary good. Hence, we can assume the entire budget is always spent, ie ∑jfj=1. Since the fractions spent fj, as well as the ratios cjC, are positive and sum to 1, they are akin to probability distributions, and we notice that u(f) is mathematically, up to a factor of −C, the cross-entropy of the "budget distribution" given by the fj relative to the "value distribution" given by the cjC!
Cross-entropy is minimized whenever the two distributions are equal, so the best possible budget allocation is f∗j=cjC, effectively spending on goods proportionally to their value.[4] The maximum utility is then
u∗=C∑jcjClogcjC,which is the entropy of the distribution cjC.[5]
However, suppose now that the agent does not know the true value coefficients, and instead only has access to imprecise estimates ^cj of the true cj, perhaps because the household does not accurately describe its preferences. If the agent maximizes the proxy utility function ^u(f)=∑j^cjlog(fj), then it will choose the best possible fj according to its estimate, ^fj=^cj^C[6], and the true utility obtained will then be
~u=u(^f)=C∑jcjClog(^cj^C).The utility lost due to misspecification is then
L=u∗−^u=C∑jcjClog(cj/C^cj/^C),which is precisely the K-L divergence of the real value coefficients cj from the estimated ones ^cj!
Big losses from forgotten values
A common worry about utility-maximizing agents is that we would give them an incomplete description of human preferences, entirely forgetting some aspect of the world that we do value. Such an agent would then optimize for the proxy utility function we gave it, neglecting this forgotten aspect and leading to outcomes which we find very bad even though they score highly on the metrics we thought of.
In this model, the above situation would correspond to having ^cj≪cj for some j. This makes the quotient cj^cj very large and induces high utility loss, going to infinity in the limit ^cj→0.[7]
Average utility loss
We have determined that utility is lost in any instance of this scenario where the normalized estimated value coefficients ^cj^C differ from the true normalized value coefficients cjC. In order to quantify this loss without choosing any particular arbitrary values for the coefficients cj and ^cj, we can instead choose probability distributions from which they are drawn[8], and determine the average loss.
More specifically, let's assume that the true value coefficients cj are independently log-normally distributed, with logcj∼N(0,η2) for some "goods heterogeneity" parameter η≥0; the case η=0 corresponds to having all goods be equally valuable, and as η increases goods become more likely to have very different values. Likewise, we assume that the misspecification ratios ^cj/cj are also log-normally distributed, independently from one another and from all the coefficients cj, with log^cjcj∼N(0,σ2) for some "misspecification degree" parameter σ>0. Estimates are perfectly accurate, i.e. ^cj=cj, when σ=0, and less precise when σ increases.
Now, observe that the difference between optimal utility and utility reached by proxy-maximization may be written as
L=∑jcjlog(cj^cj)+Clog^CC.Since the misspecification ratios cj^cj are assumed to be lognormally distributed around 1 and independent of the value coefficients cj, we have
E(cjlog(cj^cj))=E(cj)⋅E(log(cj^cj))=E(cj)⋅0=0and hence E(L)=E(Clog(^CC)).
Since ^C is the sum of k independent identically distributed coefficients ^cj, we can consider ^C to be relatively close to its expected value when k is large, and likewise for C. This would suggest the approximations C≈E(C)=k⋅E(c1)=keη22 and ^C≈keη2+σ22, which yield
E(L)≈k⋅σ22⋅eη22.We would expect this approximation to be better for large k and relatively small η and σ, and eyeballing numerical simulations, it seems that this is indeed the case. The expected utility loss is larger the more uncertain we are about the true value coefficients (ie when σ is large), and it also grows with k and with η.
Numerical evidence for Goodhart effect
Tautologically, optimizing for a proxy utility function yields less good results than directly optimizing the true utility function. However, it could still be the case that optimizing the proxy utility function is "the best one can do", in the sense that on average, actions ranked higher by the proxy utility function are in fact better in terms of true utility, even if they are not as good as the truly optimal action. If this is not the case — that is, if, beyond some quantile of proxy ranking, true quality of actions ceases to increase with increasing proxy rank — then we have an instance of the Goodhart effect.[9]
To test whether this model demonstrates the Goodhart effect, we implemented the following in Python:
We then aggregate this data according to the rankings, and produce plots which look like the following:
This graph can be understood as follows: with these parameters, choosing the policy ranked (for example) 200th by the estimated utility function yields an average true utility of about -50. Choosing the very best policy according to the estimated utility function yields about -37 utility on average, an improvement! Of course, there will be individual instances in which the estimated-best policy is not the true best policy, but here we average over many realizations of this process, which includes random generation of a true utility function and estimated value coefficients. In fact, we see that the average utility obtained at a given estimated rank increases with the rank all the way, so there is no Goodhart effect here[9]: making decisions by always taking the action ranked highest by the estimated utility function is better on average, in terms of the true utility function, than always choosing, say, the policy which is ranked at 90th-percentile by the estimate.
This plot follows a slightly different approach, showing five statistics of the true ranking as a function of the estimated rank. For example, we can read from the values of the yellow, green and red curves at x=800 that the policy with estimated rank 800 has 75% probability of being ranked higher than ~560 by the true utility function, 50% probability of having true rank at least ~750 and 25% probability of having a true rank better than ~860, respectively. As with the average true utility, the true-rank-quantiles are all increasing functions of the estimated rank, which indicates absence of a Goodhart effect.[10]
However, this changes if we modify the parameters! Let us now increase σ, the parameter governing error in the estimated value coefficients, from 1 to 3.
Here, we see that the quality of policies improves as estimated rank increases to about 950, but then decreases sharply in the top 50! With these parameters, the strategy "always pick the 95th-percentile policy" (according to the estimated-utility ranking) is superior to the strategy "always pick the highest-ranked policy". This is an instance of the Goodhart effect: improving the proxy metric, estimated utility, is a reasonable way to improve the thing we care about, true utility, until we attempt to pursue it to its extreme.
Playing around a bit, the Goodhart effect is stronger in this model for high values of the misspecification degree σ, low values of the goods heterogeneity η and small numbers of goods k.[11]
Finally, it is worth mentioning that we have assumed in this post that the agent always takes the estimated value coefficients at face value; for some thoughts on a Bayesian approach, see this footnote: [12].
Model 2: Utility estimation from rankings of samples
Our second model is more abstract and is based on the idea that the agent learns about a human's preferences only from an ordinal preference ranking provided by the human. This ranking is only provided over a finite subset of all possible states of the world, and the agent then tries to reconstruct the full utility function from this information, and subsequently takes decisions based on the obtained proxy utility function. This can be considered analogous to procedures such as RLHF where an AI is intended to infer some flavor of human value from a limited number of examples.
State space, true utility
Suppose the human cares about d∈N∗ separate quantities xi, which may each vary between −1 and 1. Accordingly, the world state space is a d-dimensional box, X=[−1,1]d. We assume the human's true utility function, u, is a polynomial in d variables with the form[13]
u(x)=f(x)d∏i=1(1−x2i),(1)where f(x) is some polynomial in d variables.
Estimated utility/proxy formation
Some N example states e1,e2,…,eN are chosen from the state space X, and the human informs the agent about their preference ordering over these examples. To model the fact that the human may not accurately report their preferences, we suppose that the human internally evaluates the utility of each point, subject to some random noise ε, yielding estimates ~ui=u(ei)+ε. The human then tells the agent its ranking of the example states according to the estimates ~ui. The agent is only given a possibly erroneous ranking of example states and does not have access to the human's estimates ~ui.
Next, the agent tries to reconstruct the user's underlying utility function by guessing the polynomial f. We assume that the agent does this by a procedure similar to LASSO regression, minimizing the L1-norm of the coefficients of the guessed polynomial ~funder the constraint that ~u(x)+1≤~u(y) whenever the user reported state y is preferable to state x.[14]
Numerical results
We simulated this process Niterations=100 times, using a state space with dimension d=5. We chose to use polynomials f with degree c=3 in each variable, and chose the coefficients uniformly from [−1,1]. The preference ranking was reported over N=20example states drawn uniformly from the state space, with reporting errors ε drawn from the normal distribution N(0,σ2) with σ=0.1.
To evaluate the reconstructed proxy utility functions ^u, we generated M=1000 evaluation states and compared their true rank (according to u) with the proxy rank (according to the proxy utility function ^u), as before:
For example, the state receiving the best proxy rank ^r=999 was among the top 100 states according to the true utility function in about a quarter of the simulations, among the top states 500 states in approximately half of simulations, but also among the bottom 200 states in about a quarter of simulations.
In other words, an agent optimizing the so estimated proxy utility function would have a 25% chance of picking an outcome that is actually among the worst 20% outcomes in this model. Note that this is worse than random! An agent optimizing a random function would only have a 20% chance of picking an outcome that is actually among the worst 20% outcomes.
For the purposes of this model, we abstract the household as one individual with coherent preferences. Of course, as the entire field of social choice theory tells us, aggregating multiple household members' preferences is a difficult problem.
The parameters here (see definition of the utility function below) are cA=3,cB=6,^cA=0.2,^cB=1.2. This is quite a large divergence between the real values and the estimates, chosen because it made the plot prettier.
This can be motivated by the theory of "household production": The goods xj are used to "produce" the household's "actual" consumption good, y, using a Cobb-Douglas production function, y=f(x)=∏kj=1xcjj, with elasticities cj, and the utility resulting from that actual consumption good is logarithmic in the produced amount y.
It is a good sanity check that the optimal policy depends only on the relative values of the goods; scaling up the value of every good by the same factor corresponds to a rescaling of the utility function u, which may not affect preferences.
Up to the same factor of −C.
The agent may have a Bayesian belief distribution over possible values of the factors cj. Since utility is linear in the cj, the expected utility of a given action is E(u(f))=E(∑jcjlogfj)=∑jE(cj)logfj and hence the agent will act exactly as if it were maximizing the single utility function ^u with ^cj=E(cj).
This argument is somewhat incomplete, since it suggests that utility loss will be negative if we overestimate the value of a good!
Suppose we have overestimated the value of some good, ie ^cj≫cj. The ratio cj^cj will be very small, and the term cjClog(cj/C^cj/^C) will indeed have a negative contribution to the loss. However, the ratio C^C will be very small as well, and this has the effect of increasing loss; as this is applied over all terms in the sum, this dominates and loss is indeed positive if we overestimate the value of one good.
This effect of the ratio C^C having a larger influence than the ratio cj^cj does not apply in the undervaluing case of one ^cj≪cj, as in that case ^C still contains the values of all the other goods and does not go to zero.
The choice of these distributions is still somewhat arbitrary, but less so, since we choose only two real parameters η and σ instead of choosing all 2k values of the cj and ^cj.
Note that there could be two distinct questions here:
- Given one real utility function (a set of values cj) and one estimate (a set of estimated values cj), is true value an increasing function of estimated value?
- Given some distributions from which utility functions and estimates are drawn, is average true value an increasing function of estimated rank?
The first question reasons "ex post", and its answer is no in most cases.
The second question reasons "ex ante", so a clarifying name for the type of Goodhart effect under investigation might be "ex ante Goodhart".
It is somewhat interesting that this plot is asymmetric: the lines converge on the lower-left corner, but a gap remains in the upper right. This is because
More specifically, the proxy-worst policies are points on the edge of the simplex, setting something valued very close to zero, which is also very bad for any true utility function in the family we are sampling from. The proxy-best policies are in the interior of the simplex and may differ substantially from the truly-best policies, so the curves remain separate to the right.
The fact that the Goodhart effect is easier to observe when k is small is possibly due to dimensionality effects: if the space of policies has large dimension, then the Npoints uniformly-chosen policies will mostly be mediocre, and only a small fraction of them will be anywhere close to optimal for the estimated or true utility functions.
One could also examine cases where the agent is a good Bayesian and has knowledge of the random processes that determine the estimated value coefficients ^cj from the true value coefficients cj ; in this case, this would correspond to knowing the parameters η and σ, which determine the shape of the prior. The estimates ^cj would then serve as evidence, and the agent would base its decisions on its posterior beliefs about the true values of the cj. The calculations are quite straightforward, since everything in this model is nicely (log-)gaussian. Our understanding is that such an agent will not exhibit the Goodhart effect if its belief about η and σ matches reality, but that it may show Goodhart effect when η and σ are not accurately known.
This form was chosen because it fixes utility to zero on the boundary of the state space, which causes the optimal state to be in the interior of the state space X.
The more natural-seeming constraint that u(x)<u(y) whenever y was reported preferable to x has the issue that optimization yields a proxy utility function where all example states are very close together, so we force an arbitrary separation.