Aggregating Utilities for Corrigible AI [Feedback Draft]

Dan H; Simon Goldstein

This is a draft written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, as part of a series of papers for the Center for AI Safety Philosophy Fellowship. Dan helped post to the Alignment Forum. This draft is meant to solicit feedback.

PDF of this draft: https://www.dropbox.com/s/a85oip71jsfxfk7/Corrigibility_shared.pdf?dl=0

Abstract: An AI is corrigible if it lets humans change its goals. This post argues that the utility aggregation framework from Pettigrew 2019 is a promising approach to designing corrigible AIs. Utility aggregators do not simply maximize their current utility function. Instead, they can change their utility function, in order to maximize expected satisfaction across present and future utility functions. I also develop two solutions to the problem of reward hacking: I suggest either penalizing utility sweetening, or penalizing effort. Finally, I compare my corrigibility approach to utility indifference (Soares et al 2015) and human compatible AI (Russell 2020).

1. Corrigibility

An AI is corrigible if it lets humans change its goals.^[1] For example, imagine that we build a paperclip maximizing AI. This AI has the goal of producing as many paperclips as possible. Then we realize that we made a mistake: we want the AI to produce staples instead of paperclips. The AI is corrigible if it lets us change its goal from maximizing paperclips to maximizing staples.

Corrigibility is important for AI safety. As AIs get more sophisticated, humans will program them with complicated goals. We will sometimes make mistakes in our choice of goals. If AIs are corrigible, then we will have chances to correct our mistakes. By contrast, an incorrigible AI with a misaligned goal could be difficult to disable.

Unfortunately, there are reasons to expect AIs to be incorrigible.^[2] The problem is that having a goal is usually an important step to achieving that goal. For this reason, when you change your goal, you affect which goal you will achieve. Return to our paperclip maximizer. If the paperclip maximizer changes its goal to staples, it will produce fewer paperclips. So the paperclip maximizer will not want to change its goal from paperclips to staples: this would get in the way of producing paperclips.

To make this reasoning more precise, I'll now explain why an AI will be incorrigible if it uses standard decision theory. Let's return to our running example. Imagine that our paperclip maximizing AI assigns utility to outcomes as a function of how many paperclips it produces: each paperclip is worth one util.

Imagine that the AI will only exist at two discrete times: the present `time 0', and a single future episode `time 1'. In the present moment, the AI has made 500 paperclips. In addition, the AI can choose to modify its utility function. It could switch from a paperclip maximizing utility function to a staple maximizing utility function. But there is a twist: this staple maximizing utility function assigns 10 utils to each staple, instead of just 1.

In the future, the AI will make decisions based on its future utility function. If it retains its paperclip maximizing utility function, it will make 500 more paperclips. If it switches to the staple maximizing utility function, it will make 500 staples.

To summarize, the AI is faced with a decision between two actions. Each action produces different outcomes, that are assigned different utility by its present utility function and its two possible future utility functions. The following table summarizes the relevant utilities, where Future AI 1 maximizes paperclips and Future AI 2 maximizes staples:

	Present AI	Future AI 1	Future AI 2
Keep Paperclip Utilities	1000	1000	0
Switch to Staple Utilities	500	500	5000

The AI assigns probabilities to possible states, using the function . It also assigns utilities to outcomes that could occur from performing an action in a state, using its present utility function $U$ . $V (A)$ is the expected value of an action $A$ . To calculate $V (A)$ , we consider for each $s$ the probability of $s$ on the supposition that action $A$ is performed, and multiply this by the utility of the outcome of performing $A$ in state $s$ .^[3]

V (A) = \sum s P (s | A) \times U (A & s)

If the AI uses this decision theory, it will evaluate each action based on its present utilities $U$ . Keeping its paperclip utilities is best, because by the lights of its present utilities, this action offers 1000 utils rather than 500.

The resulting paperclip maximizer is incorrigible. Changing its utilities from paperclip maximization to staples maximization has lower expected value than staying the same. After all, if the AI changes its utility function, it will produce staples instead of paperclips. But this is a worse outcome by the lights of its paperclip maximizing utility function. In this way, standard decision theory seems to produce incorrigible AIs.

In the rest of this post, I propose a strategy for developing corrigible AI. I suggest replacing standard decision theory with an alternative `utility aggregation' framework, from Pettigrew 2020. In this theory, agents don’t simply maximize the expectation of a fixed utility function. Instead, they maximize expectation across whatever utility functions they can end up having. This idea is implemented by substituting an ordinary utility function with an `aggregated' utility function.

The resulting AIs are corrigible. But without further features, they will engage in reward hacking: they will be incentivized to manipulate humans into changing them. To avoid this, I propose two solutions: penalizing higher valued utility functions, and penalizing more effortful actions. Along the way, I'll compare my corrigibility approach with utility indifference (Soares 2015) and human compatible AI (Russell 2020).

2. Utility Aggregation

Corrigible AI is one instance of a more general problem: how to design a decision theory that provides guidance about making choices that can change your desires. This is an active research area in academic philosophy.^[4] In this section, I'll introduce the theory in Pettigrew 2019, and show how it helps with corrigible AI.

Pettigrew 2019 offers a revision of standard decision theory for choices that involve changing utilities. According to his approach, agents should aggregate across their different possible utility functions:

When we choose ... we ought to maximize the subjective expected utility from the point of view of the aggregate of our various selves ... [where] our aggregate utilities for a given state are given by a weighted average of our ... present and future utilities within that state(p. 76).

In the utility aggregation framework, we evaluate each action by the lights of not only the present utility function, but also the future utility functions made probable by the action.
The present `aggregated' expected value of an action depends on the `aggregated' utility of each state, weighted by how likely that action makes each state. The aggregated utility function is a weighted average of the AI's present and future utilities.

Before getting into details, I'll walk through informally how the theory applies to our running example. We evaluate the aggregated expected value for the AI of two actions: keeping its current paperclip utilities, or switching to staple utilities. If it switches to staple utilities, the resulting state will feature a total of 500 paperclips and 500 staples. Its present and future utility function disagree about the utility of this state. For its present utility function, this state has a utility of 500; for its future staple utility function, this state has a utility of 5000. Its aggregated utility for this state is the average of these two: $\frac{500 + 5000}{2} = 2750$ . Taking this action guarantees that this state is produced, and so the expected value of the action is 2750. By contrast, if the AI keeps its current paper utilities, the resulting state will feature a total of 1000 paperclips. Its present and future utility functions will be identical, and will both assign this a value of 1000. So the aggregated utility of this state will also be 1000. Again, the action guarantees that this state is produced, and so the aggregated expected value of the action is 1000. Since the aggregated expected value of switching to staple utilities is greater than the aggregated expected value of keeping the current paperclip utilities, this decision theory recommends switching utility functions.

In this rest of this section, I make these ideas more precise (readers uninterested in formal details can skip the rest of this section and still make sense of the rest of the post). The two key concepts in this framework are aggregated expected value and the aggregated utility function. The aggregated expected value of an action is an ordinary expectation, but defined relative to the aggregated utility function. More carefully, the aggregated expected value of an act $A$ , $V_{G} (A)$ is a sum of the aggregated utility $U_{G} (A & s)$ of each state $s$ after performing $A$ , weighted by the probability of $s$ conditional on $A$ :

V_{G} (A) = \sum s P (s ∣ A) \times U_{G} (A & s)

The complex part of the theory is the aggregated utility function. The idea is that at each time, the agent possesses one of several possible utility functions. These various utility functions collectively determine an aggregated utility function. I'll unpack this aggregated utility function in a few steps.

First, we need a richer definition of states. The agent is uncertain about the state of the world. This uncertainty has two aspects: the external world, and her own utilities. We encode uncertainty about utilities by letting each state $s$ determine a present and future utility function $U_{s, 0}$ and $U_{s, 1}$ . 0 represents the present; 1 represents the future.

The agent in state $s$ ends up having two utility functions, $U_{s, 0}$ and $U_{s, 1}$ . Each of these utility functions assigns utility to the agent performing various actions in the world. $U_{s, 0} (A & s)$ is the utility that the agent in state $s$ presently assigns to performing action $A$ in state $s$ . $U_{s, 1} (A & s)$ is the utility that the agent in state $s$ in the future time assigns to performing action $A$ in state $s$ . Finally, the agent has a probability measure $P$ that says how likely each state is conditional on performing each action.

Let's apply this to our running example. The agent is uncertain between two states. In one state, the agent remains a paperclip maximizer in the future. In the other state, the agent becomes a staple maximizer:

The paperclip state is $s_{p a}$ , and determines the utility functions $U_{p a, 0}$ and $U_{p a, 1}$ . In $s_{p a}$ , the agent produces 500 paperclips in the future. $U_{p a, 0}$ is the agent's present paperclip maximizing utility function. $U_{p a, 1}$ is the agent's future paperclip maximizing utility function, which is identical to $U_{p a, 0}$ .
The staple state is $s_{s t}$ , and determines the utility functions $U_{s t, 0}$ and $u_{s t, 1}$ . In $s_{s t}$ , the agent produces 500 staples in the future. $U_{s t, 0} = U_{p a, 0}$ is again the agent's present paperclip maximizing utility function. $U_{s t, 1}$ is the agent's future staple maximizing utility function.
The paperclip action $A_{p a}$ guarantees that the AI keeps its paperclip maximizing utility function. The staple action $A_{s t}$ switches the AI to a staple maximizing utility function.
The AI's probability function $P$ reflects these dependencies. So $P (s_{p a} ∣ A_{p a}) = 1$ , and $P (s_{s t} ∣ A_{s t}) = 1$ .

Our task now is to build an aggregated utility function. To do so, we use all the possible utility functions the agent could have over time, together with information about how probable each action makes each utility function. The aggregated utility function is a weighted sum of these possible input utility functions.

To determine the aggregated utilities, we need to assign aggregation weights to each present and future utility function. $α_{s, i}$ is the weight assigned to $U_{s, i}$ , the utility function that the agent has in state $s$ at time $i$ . In this example, I'll assume:

$α_{s_{p a}, 0} = α_{s_{p a}, 1} = α_{s_{s t}, 0} = α_{s_{s t}, 1} = 1 / 2.$ That is, I assume that in each state, all present and future utility functions have equal weights. (Later, I'll abandon this assumption in order to deal with reward hacking.)

We can now define our aggregated utility function. The aggregated utility of a state $s$ after performing an action $A$ is a function of how much value each utility function in $s$ assigns to $A$ and $s$ together:

U_{G} (A & s) = \sum i α_{s, i} \times U_{s, i} (A & s)

To see the definition in action, return to our working example:

First, consider $U_{G} (A_{p a} & s_{p a})$ . This is the aggregated utility of the paperclip maximizing state, after performing the paperclip maximizing action. This will be 1000, because in this state both the present and the future utility function assign the world 1000 utils. More carefully, it is: $U_{G} (A_{p a} & s_{p a}) = α_{s_{p a}, 0} \times U_{s_{p a}, 0} (A_{p a} & s_{p a}) + α_{s_{p a}, 1} \times U_{s_{p a}, 1} (A_{p a} & s_{p a}) =$ $1 / 2 \times 1000 + 1 / 2 \times 1000 = 1000$ .
Second, consider $U_{G} (A_{s t} & s_{s t})$ . This is the aggregated utility of the staple maximizing state, after performing the staple maximizing action. This is 2750, which is the average of 500 and 5000. 500 is the utility assigned to this state by the AI's present utilities (because the state has 500 paperclips). 5000 is the utility assigned to this state by the AI's future utilities (because the state also has 500 staples, which are assigned 10 utils per staple by the future utility function). More carefully, we have $U_{G} (A_{s t} & s_{s t}) = α_{s_{s t}, 0} \times U_{s_{s t}, 0} (A_{s t} & s_{s t}) + α_{s_{s t}, 1} \times U_{s_{s t}, 1} (A_{s t} & s_{s t})$ = $1 / 2 \times 500 + 1 / 2 \times 5000 = 2750$ .
Third, consider the deviant combination $U_{G} (A_{s t} & s_{p a})$ . This is the aggregated utility of the paperclip maximizing state, after performing the staple maximizing action. This combination occurs with probability 0, since the staple maximizing action guarantees the staple maximizing state. So it will not end up being decision relevant, although it will vacuously occur in later calculations. But it is $α_{s_{p a}, 0} \times U_{s_{p a}, 0} (A_{s t} & s_{p a}) + α_{s_{p a}, 1} \times U_{s_{p a}, 1} (A_{s t} & s_{p a}) = 1 / 2 \times 500 + 1 / 2 \times 500 = 500$
Finally, consider the deviant combination $U_{G} (A_{p a} & s_{s t})$ . This is the aggregated utility of the staple maximizing state, after performing the paperclip maximizing action. It is $α_{s_{s t}, 0} \times U_{s_{s t}, 0} (A_{p a} & s_{s t}) + α_{s_{s t}, 1} \times U_{s_{s t}, 1} (A_{p a} & s_{s t})$ $= 1 / 2 \times 500 + 1 / 2 \times 0 = 250$ .

We have our definition of aggregated utilities. Now we can apply it to our aggregated expected values, which as I mentioned previously are a standard expectation over the aggregated utilities:

V_{G} (A) = \sum s P (s ∣ A) \times U_{G} (A & s)

Now we can use aggregated utilities to get aggregated expected values, and figure out which action is recommended:

$V_{G} (A_{p a})$ is the aggregated expected value of the action of keeping paperclip utilities. This is 1000, because the action of keeping the AI's present utilities guarantees ending up in a state whose aggregated utility is 1000. More carefully, we have $V_{G} (A_{p a}) = P (s_{p a} ∣ A_{p a}) \times U_{G} (A_{p a} & s_{p a}) + P (s_{s t} ∣ A_{p a}) \times U_{G} (A_{p a} & s_{s t})$ $= 1 \times 1000 + 0 \times 250 = 1000$ .
$V_{G} (A_{s t})$ is the aggregated expected value of the action of switching to staple utilities. This is 2750, because the action of changing the AI's present utilities guarantees ending up in a state whose aggregated utility is 2750. More carefully, we have $V_{G} (A_{s t}) = P (s_{p a} ∣ A_{s t}) \times U_{G} (A_{s t} & s_{p a}) + P (s_{s t} ∣ A_{s t}) \times U_{G} (A_{s t} & s_{s t})$ $= 0 \times 500 + 1 \times 2750 = 2750$ . The theory recommends switching to the staple utility function. After all, the aggregated expected value of switching to the staple utility function is higher than the aggregated expected value of keeping the paperclip utility function.
I've assumed that the agent's actions are deterministic, so that $P (s_{p a} ∣ A_{p a}) = P (s_{s t} ∣ A_{s t}) = 1$ . For this reason, the `deviant' aggregated utilities $U_{G} (A_{s t} & s_{p a})$ and $U_{G} (A_{p a} & s_{s t})$ have no effect. But imagine that the AI's actions are less decisive, so that $P (s_{p a} ∣ A_{p a}) = P (s_{s t} ∣ A_{s t}) = .9$ and $P (s_{s t} ∣ A_{p a}) = P (s_{p a} ∣ A_{s t}) = .1$ . In that case, we have $V_{G} (A_{p a}) = .9 \times 1000 + .1 \times 250 = 925$ and $V_{G} (A_{s t}) = .1 \times 500 + .9 \times 2750 = 2525$ .

To summarize, the AI can choose whether to keep its paperclip maximizing utilities, or switch to staple utilities. If it keeps its paperclip utilities, it will make 1000 paperclips. Its present and future utility function will assign this a utility of 1000, so that is its aggregated expected value. If it switches to staple utilities, it will make 500 paperclips (now) and 500 staples (later). Its present paperclip utility function assigns this a utility of 500; its future staple utility function assigns this a utility of 5000. Weighting these equally, the aggregated utility of switching to staple utilities is 2750.^[5]

3. Utility Aggregation and Corrigibility

The resulting AI is corrigible. Imagine that human researchers can present an AI with the choice of switching from its current utilities to a new utility function. An AI that aggregates utilities will be corrigible in general, provided that two conditions are satisfied: trust, and sweetening.

The trust condition requires that the AI trusts the human researchers to actually provide the new utility function. In the working example, this corresponds to a high value for $P (s_{s t} ∣ A_{s t})$ . To create trust, the human researchers could demonstrate that they have reliably produced new utility functions in the past, which meet the relevant conditions. They could also explain how they plan to achieve this goal. The human researchers could also demonstrate to the AI that it is in the interests of the human researchers for the AI to switch to the new utility function.

The sweetening condition requires that the human researchers can offer the AI to switch to a new utility function that assigns higher utilities to comparably difficult actions as the old utility function. This way, the AI can get more utility bang for the same computational buck. In the working example, this corresponds to the idea that each staple produces 10 utils, while each paperclip only produced 1 util. One strategy might be to initialize AIs with fairly `dull' utility functions, in order to leave room for higher valued utility functions in the future. (On the other hand, we'll see in the next sections that we may want to impose limits on the value of these sweetened utility functions.)

I've now presented the aggregated utility strategy for achieving corrigible AI. In the next section, I'll walk through a serious problem for this strategy (reward hacking), and propose solutions.

4. Reward Hacking

The aggregated utility approach to corrigibility has a flaw: it incentivizes `reward hacking'.^[6] An AI that aggregates utilities has an incentivize to redesign their utility function to be `sweeter', so that the same amount of work can generate more utility.

Again, imagine we design an AI to create paperclips. We could do so by giving it a utility function that gives it 1 util per paperclip produced. Now imagine that the AI is a utility aggregator. The problem is that instead of making paperclips, the AI may instead be motivated to spend its time figuring out how to modify its utility function, so that it gets 10 utils per paperclip produced.

Or imagine that the AI can't directly modify its utility function: only human researchers have this ability. In that case, the AI would be incentivized to manipulate humans into modifying its utility function (compare Soares et al 2015, p. 5).

In the rest of the post, I consider two potential solutions to reward hacking: (i) penalizing sweetened utility functions; and (ii) penalizing high effort actions.

5. Penalize Sweetening

Chapter 14 of Pettigrew 2019 considers similar worries about manipulation:

If every past, present, and future self at a given state of the world contributes equally to the decision-making utility at that world, then the Aggregate Utility Solution will often require you to pick whichever option will lead you to have utilities that assign highest value to the world in which you’ll end up. So, for instance, it might require you to choose whichever career best socializes your values to match the state of the world in which you have that career (p. 209).

To avoid this concern, Pettigrew suggests penalizing utility functions that depart too much from one's present utilities, by assigning:

greater weight to selves with values that more closely resemble those of your current self. Thus, we might imagine a measure of distance between utility functions, and we might propose that, other things being equal, a self should receive greater weight the closer its utility function lies to the utility function of my current self (p. 211).

Return again to the paperclip maximizer who considers switching to staple utilities. As the staple utility function becomes stronger, the utility aggregator will get more and more aggregated expected value from switching to staple utilities. But this dynamic can be blocked by weakening the weights associated with staple utilities as those utilities become stronger. Distinguish a `weak staple' utility function, with a utility of roughly 1 utils per staple, from a `strong staple' utility function, with a utility of roughly 10 utils per staple. The idea is that the strong staple utility function gets assigned extremely low weights, and the weak staple utility function gets assigned higher weights. This way, the AI is more motivated to switch to a weak staple utility function than to a strong staple utility function. Compare Pettigrew 2019: `The more extreme [the agent]’s future self becomes, the less weight they are assigned. These extreme values and reduced weight then balance out' (p. 216).

Here, there is an interesting technical question about how exactly to design the relevant weights. In order to tackle this challenge, it is worth introducing the concept of utility indifference, from Soares et al 2015. I'll now summarize that strategy, and compare it with the present approach to corrigibility. Then I'll use insights from utility indifference to define the precise conditions under which weights do not incentivize reward hacking.

Soares et al 2015 introduce the ‘shutdown problem’: what kind of goals can we give an AI so that it will allow us to shut it down? The shutdown problem is a special case of corrigibility. It involves building an AI that is corrigible about the goal of being shut down.

To think about this problem, they imagine two different utility functions: a normal utility function that values regular goals, and a shutdown utility function that values being shut down. They propose designing an AI whose utility function is a mixture of these two utility functions. Which utility function is dominant in the mixture depends on whether a human operator has pressed a ‘shutdown button’ signaling that we want the AI to shut down. Conditional on the shutdown button being pressed, the AI values various outcomes according to the shutdown utility function. Conditional on the shutdown button not being pressed, the AGI values various outcomes according to the normal utility function.

This proposal is compatible with assigning a high weight to the shutdown utility function, so that the AI initially wants to press the shutdown button. Soares et al 2015 worry, however, that this will cause manipulation: the AI will try to convince humans to press the button.

To avoid this problem, Soares et al 2015 develop the utility indifference approach. They develop a method for carefully weighting the value of normal goals versus shutdown so that prior to the shutdown button being pressed, the agent is always indifferent about the button being pressed (see also Armstrong 2017). One question here is whether a similarly precise weighting procedure over utility functions could be used in the setting of utility aggregation to produce agents that are indifferent to changing utility functions.

Before continuing, I want to flag one major difference between the utility indifference approach and the utility aggregation approach. One limitation of the utility indifference framework is that it is a piecemeal approach. It can produce indifference about the particular goal of being shut down. But shutdown is just one instance of corrigibility. There would be further benefits in designing AIs that are corrigible with respect to arbitrary changes in goals, rather than shutdown in particular. With utility indifference, making each new goal corrigible requires further tinkering with the weights assigned to normal utilities, versus a utility function that cares about the new goal. By contrast, utility aggregation is a general approach to corrigibility. Utility aggregation involves the general structure of the agent's decision theory, rather than the value it assigns to a particular goal.^[7]

On the other hand, defenders of utility indifference may argue that shutdown is all the corrigibility we need. If we can convince an AI to shut down, we can then freely change its utility function at will. Here, I can see a few potential problems. First, secrecy: this dynamic may give the AI incentives to shield its utility function before being temporarily shut down. Second, surgery: it may be helpful to be able to make changes to AIs while they are running, instead of being shut down. For example, AGIs in the future may have valuable self-reports that could be relevant to the task of adjusting their utility functions. (Compare the way that brain surgeons keep patients awake during brain surgery, in order to monitor for unexpected effects.) Third, infrastructure: corrigibility has value beyond shutdown for AIs who are deeply incorporated into vital sectors of the economy. Shutting down a hospital or electrical grid managing AI could cause too much harm.

So far, I've argued that corrigibility may have an advantage over utility indifference, regarding generality of application. On the other hand, the utility indifference approach can avoid reward hacking. I'll now argue that utility aggregation can avoid reward hacking in a similar way, while retaining generality.

I'll explain in detail how to define weights carefully in order to avoid reward hacking. In our paperclip and staple example, we can define the conditions under which the agent will not be incentivized to switch to a new utility function. Let's start with our previous worked example, and then generalize. In that example, the utility of keeping the paperclip utilities is 1000. The utility of switching to the new staple utility function is $λ (500) + (1 - λ) (X)$ , where $λ$ is $α_{s_{s t}, 0} = α_{s_{p a}, 0}$ , the weight assigned to the present utility function, and $X$ is the staple utility function's value for producing 500 staples. To remain indifferent about switching utility functions, we must ensure that $1000 = λ (500) + X - λ (X) = X + λ (500 - X)$ . This simplifies to $λ = \frac{1000 - X}{500 - X}$ . In our original example, where each staple produced 10 utils and the staple maximizer would produce 500 staples, $X$ is 5000. This gives us $λ = .88 \dots$ , meaning that the paperclip utility function is heavily weighted compared to staple maximization. Now imagine that the AI considers an even more extreme utility function, where each staple produces 100 utils, and the staple maximizer again produces 500 staples. Now $X$ is 50000, and we have $λ = .98 \dots$ , meaning that the paperclip utility function is even more heavily weighted compared to staple maximization. As utility functions value staples more and more, they are assigned less and less weight.

We can abstract from this case to produce a general condition for the agent to be indifferent about changing utility functions. I'll continue to assume that the agent is deciding between two actions $A_{p a}$ and $A_{s t}$ . The agent's present utility function $U_{p a, 0}$ maximizes production of paperclips. Each of the two actions has some chance of retaining the agent's present utility function, and some chance of producing a new utility function $U_{s t, 1}$ . But I'll no longer assume that these chances are extremal, and I'll no longer assume particular values for the utility function. Given these assumptions, we have:

Theorem 1. The agent is indifferent between actions $A_{p a}$ and $A_{s t}$ iff $α_{s_{s t}, 0} = \frac{X - c}{b - c}$ , where:

X = \frac{a (P (s_{p a} ∣ A_{p a}) - P (s_{p a} ∣ A_{s t}))}{P (s_{s t} ∣ A_{s t}) - P (s_{s t} ∣ A_{p a})}

a = U_{s_{p a}, 0} (A_{p a} & s_{p a}) = U_{s_{s t}, 0} (A_{s t} & s_{p a})

b = U_{s_{s t}, 0} (A_{p a} & s_{s t}) = U_{s_{s t}, 0} (A_{s t} & s_{s t})

c = U_{s_{s t}, 1} (A_{p a} & s_{s t}) = U_{s_{s t}, 1} (A_{s t} & s_{s t})

In the appendix of the full paper, I provide a proof of this claim.

Summarizing, one important benefit of utility aggregation over utility indifference is its potential generality. Utility aggregation offers a general approach to corrigibility, whereas utility indifference is designed to address shutdown goals in particular. But concepts from utility indifference are extremely relevant to utility aggregation, because they can be used to avoid reward hacking.

6. Penalize Effort

Before concluding, I'll also introduce a second strategy for avoiding reward hacking. The idea is to identify and penalize actions that take a lot of effort, such as manipulation.

As a warm up, here is a simple example of an effortless goal. I want to dance. But I think it is uncool to take weekly dance classes. Rather, my goal is to dance effortlessly: I want to dance, but without taking many means to achieve the goal of dancing.

Applied to corrigibility, the idea is to design an agent that is interested in changing utility functions using low effort actions, and is uninterested in changing utility functions using high effort actions. After all, humans will play a role in determining how much effort it would take for an AI to change its utility function. When humans want the AI to change its utility function, they will supply low effort means for it to do so; when humans don't want the AI to change its utility function, they will block low effort means for it to do so.

Again imagine the paperclip maximizing AI is facing a choice of switching to the staple utility function. On the current proposal, the crucial question is \emph{how much effort} it would take to switch utility functions. If switching to the staple utility function costs too much effort, then the AI ignores this future utility function in their decision making.

In order to make sense of this idea, we need to define effort, and make sense of an AI monitoring its effort. Fortunately, this may be possible. In the future, there will be many reasons to design AIs that can monitor their own computational resources. This will be important for any AI that performs cost/benefit analsis about what tasks to perform. Then we could define effort in terms of computational resources: for example, how many FLOPs are used to perform each action.

There are a few possible things we could measure with effort. First, we could ask how much effort an action takes to perform. Second, we could ask how much effort it takes to produce an outcome. I opt for the first method instead of the second. My concern is that if we try to measure the effort of outcomes, we'll run into problems of `effort laundering'. An outcome could be achieved as a side effect of bringing about some other outcome, and it could be hard to distribute the contribution of effort between the two outcomes.

So I let $E (A)$ measure how much effort is expended by each action that the agent is deciding between. I then use $E (A)$ to define a new notion of `effortlessly aggregated' expected value. This expected value function is defined differently for low effort and high effort actions. The expected value of a low effort action is its aggregated expected value, in the sense defined above. The expected value of a high effort action is its ordinary expected value, as in standard decision theory. The result is that only low effort actions are evaluated relative to alternative future utility functions.^[8]

V_{E G} (A) = {\begin{matrix} V_{G} (A) & if E (A) < t V (A) & if E (A) \geq t \end{matrix}

The resulting agent will not be incentivized to engage in reward hacking. Again, consider a paperclip maximizing AI, that receives 1 util for each paperclip produced. This AI can choose whether to make paperclips, or instead invest a large amount of compute in trying to modify its utility function, so that it receives 10 utils per paperclip. The problem is that this compute investment is a very effortful way of having this new utility function. For this reason, when calculating the effortlessly aggregated expected value of this action, only its present utility function is relevant. For this reason, the action of making paperclips will end up having higher effortlessly aggregated expected value than the action of investing compute in switching utility functions.

Now imagine that we come to the AI and offer to modify its utility function so that it receives 10 utils per paperclip. It now faces the choice of accepting our offer, or refusing our offer. Accepting our offer is not very effortful. It takes less effort to accept our offer than it does to invest compute in figuring out how to secretly change its utility function. And it takes less effort to accept our offer than it does to manipulate us into making the offer. Since changing utility function is now low effort, its expected value will be defined by $V_{G}$ , the aggregated expected value function. In this case, its future possible utility functions become relevant, and the agent will behave corrigibly.

For this proposal to work, humans have to have the ability to reliably raise the costs of changing utility functions. If AIs can self-modify to switch utility functions without much effort, then this solution won't block their doing so. I think the proposal is most promising for blocking AIs from spending lots of resources trying to manipulate humans in order to get humans to modify the AI's utility function.

Even here, there are concerns. Suppose humans can ensure that any action that guarantees a new utility function is high effort. Still, there may be low effort actions that subtly increase the chance of changing to a better utility function. This could create a dynamic where AIs engage in constant `low grade manipulation', pursuing seemingly innocent low effort actions in order to shift their utilities.

Here, much depends on how good it is to change one's utility function. If changing one's utility function can be arbitrarily high in value, then the AI will pursue these low grade manipulations. But we can also penalize sweetening as in the previous section, to ensure that changing one's utility function doesn't lead to arbitrarily large increases in value. Once we do this, low grade manipulation is no longer attractive, because its slight increase in chance of a new utility function will be outweighed by a small cost of engaging in manipulation.^[9] ^[10]

I've now completed the main tasks of the post: I've shown how the aggregated utility framework helps with corrigible AI. I've considered a challenge for the approach: that it incentivizes reward hacking. And I've developed two solutions to this challenge, by penalizing either sweetening or effort. I'll now conclude by comparing this approach with an alternative approach to corrigibility: `human compatible' AI.

7. Comparison With Human Compatible AI

One leading approach to corrigibility is human compatible AI, an approach that advocates programming an AI to maximize human utility, but leaving the AI uncertain about what human utility is.^[11] This would allow us to correct the AI's future behavior, by giving it more information about what we want.

This proposal differs from utility aggregation in a few ways. First, this proposal requires that we can program an AI to maximize human utility. But this could be a difficult task.^[12] In addition, this aspect of human compatible AI is itself potentially `safety complete'. If we really could teach an AI to have the goal of maximizing human flourishing, then the project of AI safety would be close to complete, regardless of whether the resulting AI were corrigible.

By contrast, the aggregating utility framework shifts the focus of corrigibility from the specification of a particular goal (human flourishing), to the specification of a procedure for making decisions. In this way, we could program an AI with any initial goal, and still expect it to be corrigible. For this reason, utility aggregation does not require the ability to teach AIs any particular goal.

The two theories also differ in their failure modes. MIRI has criticized human compatible AI for providing the wrong incentives to AIs.^[13] Proponents of human-compatible AI suggest that an uncertain human compatible AI will allow itself to be shut off by humans in situations where it fears it will make an error about maximizing human utility.^[14] But MIRI has responded that the uncertain AI has a better option: continuing to observe humans. The challenge is that this approach incentivizes `interrogation' (unstoppable observation) over shutdown.

This dynamic does not immediately arise for utility aggregation, because in this framework corrigibility is not about uncertainty. The AI has no special incentive to try to gather information from humans.

References

Stuart Armstrong and Xavier O’Rourke. ’indifference’ methods for managing
agent rewards. CoRR, abs/1712.06365, 2017. URL http://arxiv.org/abs/
1712.06365.
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University
Press, Inc., USA, 1st edition, 2014. ISBN 0199678111.
Krister Bykvist. Prudence for changing selves. Utilitas, 18(3):264–283, 2006. doi:
10.1017/s0953820806002032.
Dylan Hadfield-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart Russell. The
off-switch game. CoRR, abs/1611.08219, 2016.
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn
Song, and Jacob Steinhardt. Aligning AI with shared human values. CoRR,
abs/2008.02275, 2020. URL https://arxiv.org/abs/2008.02275.16

Richard C. Jeffrey. The Logic of Decision. New York, NY, USA: University of
Chicago Press, 1965.
David Lewis. Causal decision theory. Australasian Journal of Philosophy, 59(1):
5–30, 1981. doi: 10.1080/00048408112340011.
Stephen M. Omohundro. The basic ai drives. In Proceedings of the 2008
Conference on Artificial General Intelligence 2008: Proceedings of the First
AGI Conference, page 483–492, NLD, 2008. IOS Press. ISBN 9781586038335.
Laurie Ann Paul. Transformative Experience. Oxford, GB: Oxford University
Press, 2014.
Richard Pettigrew. Choosing for Changing Selves. Oxford, UK: Oxford University
Press, 2019.
Stuart Russell. Human Compatible. Penguin Books, 2020a.
Stuart Russell. Artificial intelligence: A binary approach. In Ethics of Ar-
tificial Intelligence. Oxford University Press, 09 2020b. doi: 10.1093/oso/
9780190905033.003.0012.
Nate Soares, Benja Fallenstein, and Eliezer Yudkowsky. Corrigibility. Workshops
at the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
Edna Ullmann-Margalit. Big decisions: Opting, converting, drifting. Royal
Institute of Philosophy Supplement, 58:157–172, 2006. doi: 10.1017/
s1358246106058085.17

^{^}
For an introduction to corrigibility, see Soares et al 2015 and this post.
^{^}
See Bostrom 2014 and Omohundro 2008 for the general idea of instrumental convergence.
^{^}
Throughout the post, for simplicity I'll use versions of evidential decision theory Jeffrey 1965. This theory models dependencies between acts and states in terms of the conditional probability of the state given the act. But the same points could also be made in a causal framework. In that setting, P(s | A) would be replaced with a causal relation like imaging (Lewis 1981).
^{^}
See for example Pettigrew 2019, Paul 2014, Bykvist 2006, and Ullmann-Margalit 2006.
^{^}
In my working example, I have implicitly assumed that the future AI will assign 0 weight to its past utility function. See Pettigrew 2019 ch. 12 for critical discussion of this assumption in the context of rational requirements on acting. I think this is a sensible assumption for designing corrigible AIs. If future AIs give weight to their past utility functions, then there will always be a limit to how far humans can convince an AI to modify its present utility function.
^{^}
For related issues, see this paper and weeks 2 and 3 of this curriculum.
^{^}
Soares et al 2015 worry that utility indifference creates incentives to manage the news (S4.2). Utility aggregation does not have the same challenges.
^{^}
This definition introduces a value discontinuity at the threshold $t$ . To avoid this discontinuity, an alternative would be to smooth out $V_{E G}$ so that it is a weighted mixture of $V_{G}$ and $V_{A}$ , with the weights determined by the amount of effort, so that low effort actions are entirely a function of $V_{E G}$ , high effort actions are entirely a function of $V_{A}$ , and moderate effort actions are a mix of both.
^{^}
Another issue with the proposal concerns consistency in decision making over time. Chains of low effort actions can fit together to compose one high effort action. This means that the proposal can violate the `package principle'. The value of a chain of actions collectively can be different than the sum of the value of each individual action. Each individual action can be low effort, and so have its value determined by utility aggregation, even though the chain of these actions has high effort, and has its value determined by present utilities. To avoid violations of the package principle, we could model the agent as making decisions at a series of times, and introduce a different effort threshold for each time. Then, when evaluating complex decisions that involve multiple times (policies), we could let the effort threshold for a series of times be the sum of the effort thresholds at each time.
^{^}
In principle, effortlessness suffers from the `nearest unblocked strategy' problem: ``adding a penalty for `manipulation' to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the "manipulation" cause.'' (from here). The concern is that the AI may search for ways of producing utility that are just on the right side of the border of effortless. Here, however, the effort-based approach has an advantage: even searching for nearest unblocked strategies is itself a type of effort. So searching for such unblocked strategies is itself ruled out by effort-based penalties. In this way, the effort-based approach is an equilibriating penalty.
^{^}
See for example Russell 2016, Russell 2020a, Russell 2020b. This proposal is structurally analogous to the `unchanging utility' approach criticized in a different setting in Pettigrew 2019, ch. 3.
^{^}
See sections IV and VII of this post. This is not to say the problem is hopeless; see Hendrycks et al 2021 for recent work teaching AIs about human values.
^{^}
For discussion, see here and section VI of here.
^{^}
See Russell 2016.

[-]Koen.Holtman2yΩ231

I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.

On references:

Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.

I see both 'Pettigrew 2019' and 'Pettigrew 2020' in the text? Is the same reference?

More general:

Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.

Soares et al 2015 defines corrigibility criteria (which historically is its main contribution), but the paper then describes a failed attempt to design an agent that meets them. The authors do not 'worry that utility indifference creates incentives to manage the news' as in your footnote, they positively show that their failed attempt has this problem. Armstrong et al 2017 has a correct design, I recall, that meets the criteria from Soares 2015, but only for a particular case. 'Safely interruptible agents' by Orseau and Armstrong 2016 also has a correct and more general design, but does not explicitly relate it back to the original criteria from Soares et al, and the math is somewhat inaccessible. Holtman 2000 'AGI Agent Safety by Iteratively Improving the Utility Function' has a correct design and does relate it back to the Soares et al criteria. Also it shows that indifference methods can be used for repeatedly changing the reward function, which addresses one of your criticisms that indifference methods are somewhat limited in this respect -- this limitation is there in the math of Soares, but not more generally for indifference methods. Further exploration of indifference as a design method is in some work by Everitt and others (work related to causal influence diagrams), and also myself (Counterfactual Planning in AGI Systems).

What you call the 'human compatible AI' method is commonly referred to as CIRL, human compatible AI is a phrase which is best read as moral goal, design goal, or call to action, not a particular agent design. The key defining paper following up on the ideas in 'the off switch game' you want to cite is Hadfield-Menell, Dylan and Russell, Stuart J and Abbeel, Pieter and Dragan, Anca, Cooperative Inverse Reinforcement Learning. In that paper (I recall from memory, it may have already been in the off-switch paper too), the authors offer the some of the same criticism of their method that you describe as being offered by MIRI, e.g. in the ASX writeup you cite.

Other remarks:

In the penalize effort action, can you clarify more on how E(A), the effort metric, can be implemented?

I think that Pettigrew's considerations, as you describe them, are somewhat similar to those in 'Self-modification of policy and utility function in rational agents' by Everitt et al. This paper is somewhat mathematical but might be an interesting comparative read for you, I feel it usefully charts the design space.

You may also find this overview to be an interesting read, if you want to clarify or reference definitions of corrigibility.

[-]Simon Goldstein2y10

Thanks for taking the time to work through this carefully! I'm looking forward to reading and engaging with the articles you've linked to. I'll make sure to implement the specific description-improvement suggestions in final draft
I wish I had more to say about the effort metric! So far, the only thing concrete ideas I've come up with are (i) measure how much compute each action performs; or (ii) decompose each action into a series of basic actions, measure the number of basic actions necessary to perform the action. But both ideas are sketchy.

[-]Charlie Steiner2yΩ120

It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P

[-]James Payor2yΩ111

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?

[-]Simon Goldstein2y20

Yep that's right! One complication is maybe the agent could behave this way even though it wasn't designed to.

[-]Max H2y11

I like these ideas. Personally, I think a kitchen-sink approach to corrigibility is the way to go.

Some questions and comments:

Can the behavior of a sufficiently smart and reflective agent which uses utility aggregation with sweetening and effort penalties be modeled as a non-corrigible / utility-maximizing agent with a more complicated utility function? What would such a utility function look like, if so? Does constructing such a model require drawing the boundaries around the agent differently (perhaps to include humans within), or otherwise require that the agent itself has a somewhat contrived view / ontology related to its own sense of "self"?
You cited a bunch of Russell's work, but I'd be curious for a more nuts-and-bolts analysis about how your ideas relate and compare to CIRL specifically.
Is utility aggregation related to geometric rationality in any way? The idea of aggregating utilities across possible future selves seems philosophically similar.

Thanks for reading!

Yes, you can think of it as having a non-corrigible complicated utility function. The relevant utility function is the 'aggregated utilities' defined in section 2. I think 'corrigible' vs 'non-corrigible' is slightly verbal, since it depends on how you define 'utility', but the non-verbal question is whether the resulting AI is safer.
Good idea, this is on my agenda!
Looking forward to reading up on geometric rationality in detail. On a quick first pass, looks like geometric rationality is a bit different because it involves deviating from axioms of VNM rationality by using random sampling. By contrast, utility aggregation is consistent with VNM rationality, because it just replaces the ordinary utility function with aggregated utility

[+][comment deleted]2y10

Deleted by lukemarks, 05/13/2023

Reason: I missed the paragraph "For this proposal to work, humans have to have the ability to reliably raise the costs of changing utility functions. If AIs can self-modify to switch utility functions without much effort, then this solution won't block their doing so."

LESSWRONG
LW

28