Suppose Ann asks robot Bob to buy her favorite fruit. Bob doesn't know what Ann's favorite fruit is, but he knows that Ann likes either apples, bananas or oranges, each with probability 1/3, and doesn't like the other two fruits at all. At the groceries, for his money Bob can buy 10 apples, 10 bananas, 10 oranges or a basket with 1 apple, 1 banana and 1 orange. What should he do (assuming that asking Ann is not an option)?

One way for Bob to approach this dilemma is to imagine he playing a game against a robot Carl in a world that is exactly like ours except that Carl may have a different policy than Bob. In this game, both robots choose and execute a policy, and the robot who produced a better outcome according to Ann's utility function wins. If they did equally well, the winner is determined randomly. Bob should execute the mixed strategy Nash equilibrium policy in this game. In our example, this is the policy of buying the gift box: it has a 2/3 chance of winning against other pure policies.

This game is known as maximal lottery. It was introduced[1] by G. Kreweras as a social decision scheme - a way to aggregate preferences of voters over candidates. In our case, Ann's possible utility functions are voters, their prior probabilities are voter fractions, and policies are candidates.

Learning utility

Imagine now that an option of asking Ann about her favorite fruit and buying 10 of it became available. Clearly this would be the winning policy. In a more general case, learning about the user's utility function can only help the robot win, so the robot always wants to do it if it can be done for free.

Theorem

Suppose we are playing the maximal lottery game, we believe that a human's utility function is  with probability  We know there is a switch X, which is initially off, such that if a human has utility function , human sets X on with probability . (We assume at least one of  is not 0 and at least one is not 1, or there would be nothing to learn from looking at X.) Player doesn't yet know if X is on, but can find it out for free (as measured by every ). Suppose that in a game where X is known to be on and prior beliefs for utility  are  is a winning strategy. Suppose in a game where X is known to be off and prior beliefs for utility  are  is a winning strategy. Then a winning strategy in our game is to take a look at X and execute policy  if X is on and  if X is off.

Proof

Denote the policy "execute  if X is on and  if X is off" by . For any other policy  the probability of policy  doing better

  

Since  and  is the winning strategy for such priors on 

we have 

Analogously, 

Therefore, the whole sum

 

Hence  is a winning strategy.

A problem with some approaches to utility elicitation, such as the approach in[2], is that they are disincentivized to learn certain truths about the user's utility function. This is known as motivated value selection, illustrated by the "cake or death" problem[3].

In the naïve cake or death problem, the agent is equally unsure between two utility functions; one assigns utility 1 to every cake it makes, the other assigns utility 1 to every death it causes. The agent is capable of either baking 1 cake or causing 3 deaths. As I've shown, an agent playing maximal lottery will want to ask the user whether cake or death is good. 

In the sophisticated cake or death problem, the agent is capable of either baking 1 cake or causing 1 death. One of these tasks is certain to succeed, and the other is hard and only succeeds 10% of the time, but the agent doesn't know which is which. The knowledge offered to the agent is "The moral task is hard". While this knowledge does not help the agent on its own, the agent also has no reason to avoid learning it. If it thinks there is a chance it will learn which task is hard before trying one, it will want to learn this information.

Cooperative Inverse Reinforcement Learning[4] is another game-based approach to value learning that solves the motivated value selection problem. CIRL treats human-AI interaction as a cooperative game where the human, but not the AI, knows the utility function, and both players are rewarded for the AI's ability to maximize this utility function. A problem with CIRL is that it assumes that the human is executing the game-theoretic optimal strategy. It expects the human to be omniscient and able to see through any manipulation. My approach does not need to assume this about the human. We do assume, however, that human preferences can be expressed by a utility function, and that the correct function has a positive probability in the prior. We also depend on a good enough world model and prior over utility functions. For example, if the model believes with a high prior probability that the human really hates talking or being watched, it would struggle to learn anything. 

Manipulation

To evaluate outcomes, we use the utility function of the human as it was at the start of the game. If the robot changes the human's utility function during the game through manipulation or neurosurgery, outcomes are still evaluated using the old function. So the robot is not incentivized to change them, unless the human wants her preferences changed. Indeed the robot probably wants to avoid corrupting human preferences on accident before it can learn them, since it will then have less access to the function it's optimizing for. A downside to this is that if the human preferences change in a natural way during the game, the robot will still be trying to act according to the preferences of the past self. However, we can choose the start and end of the game as we wish, so we can choose the episode to last no longer than a few days, and then the value misalignment should be small.

Turnoff-corrigibility

If the agent sees the human trying to turn it off, it could reason that this is evidence that its actions are net negative for the human's utility function, and turning off is the best thing to do. But it could also believe that the human is mistaken. It could think something like "the human is trying to turn me off because she cannot find her purse and thinks I stole it. She doesn't realize that her purse just fell under the couch. Instead of letting the human turn me off, I should help her find the purse and resolve the misunderstanding". Sounds OK if the robot's reasoning is correct, but could go wrong if the robot is mistaken.

Limit case

What if the agent learns the user's utility function with certainty? Is it going to maximize it? Under maximal lotteries, the answer is no. Instead, the agent maximizes the probability that its policy does at least as good as any other policy.

For example, suppose user's utility is equal to wealth, and he can either get $10 with certainty or $1000 with 10% probability. Then the former policy wins, because it does better 90% of the time. I wondered if this could be fixed by using expected utilities instead of outcomes to compare policies. But then the question is, expected by who? If we use the robot's beliefs to compute expected utilities, this motivates the robot to delude itself into thinking that its plan is good. If we use Ann's beliefs, this motivates the robot to deceive Ann. 

If the gamble is repeated many times during the episode, then the winning policy starts to approach maximizing expected utility. If the whole episode is one gamble, the rule resulting from the maximum lottery is weird. It cannot just be described as risk aversion: a 60% chance of winning 101$ is preferred to certainty of winning 100$. Generally, if there are two independent probabilities, x of winning a greater sum of money, and y of winning a smaller sum, the first one is preferred iff x > y / (y+1).

So, with respect to the user's utility function, the agent does not satisfy the independence axiom of rational behavior. This makes sense because for the agent, earning the user's utility is just an instrumental goal. The outcomes for the agent are losing or winning the maximum lottery game. The independence axiom needs not hold for instrumental goals, only for terminal ones.

What are your thoughts on this?

Thanks to Justis Mills for feedback on this post.

[1] G. Kreweras. Aggregation of preference orderings. In Mathematics and Social Sciences I: Proceedings of the seminars of Menthon-Saint-Bernard, France (1–27 July 1960) and of Gösing, Austria (3–27 July 1962), pages 73–79, 1965.

[2] Making Rational Decisions using Adaptive Utility Elicitation https://www.aaai.org/Papers/AAAI/2000/AAAI00-056.pdf

[3] Motivated Value Selection for Artificial Agents https://www.fhi.ox.ac.uk/wp-content/uploads/2015/03/Armstrong_AAAI_2015_Motivated_Value_Selection.pdf

[4] Cooperative Inverse Reinforcement Learning https://arxiv.org/abs/1606.03137

New Comment
1 comment, sorted by Click to highlight new comments since:

Nice post!

I do think there is a way in which this proposal for solving outer alignment "misses a hard bit", or better said "presupposes outer alignment is approximately already solved".

Indeed, the outer objective you propose isn't completely specified. A key missing part the AGI would need to have internalized is "what human actions constitute evidence of which utility functions". That is, you are incentivizing it to observe actions in order to reduce the space of possible s, but haven't specified how to update on these observations. In your toy example with the switch X, of course we could somehow directly feed the  into the AI (which amounts to solving the problem ourselves). But in reality we won't be able to always do this: all small human actions will constitute some sort of evidence the AI needs to interpret.

So for this protocol to work, we already need the AI to correctly interpret any human action as information about their utility function. That is, we need to have solved IRL, and thus outer alignment.

But what if we input this under-specified outer objective and let the AI figure out on its own which actions constitute which evidence?

The problem is it's not enough for the AI to factually know this, but to have it internalized as part of its objective, and thus we need the whole objective to be specified from the start. This is analogous to how it's not enough for an AI to factually know "humans don't want me to destroy the world", we actually need its objective to care about this (maybe through a pointer to human values). Of course your proposal tries to construct that pointer, the problem is if you provide an under-specified pointer, the AI will fill it in (during training, or whatever) with random elements, pointing to random goals. That is, the AI won't see "performing updates on  in a sensible manner" as part of its goal, even if it eventually learns how to "perform updates on  in a sensible manner" as an instrumental capability (because you didn't specify that from the start).

But this seems like it's assuming the AI is already completely capable when we give it an outer objective, which is not the case. What if we train it on this objective (providing the actual score at the end of each episode) in hopes that it learns to fill the objective in the correct way?

Not exactly. The argument above can be rephrased as "if you train it on an under-specified objective, it will almost surely (completely ignoring possible inner alignment failures) learn the wrong proxy for that objective". That is, it will learn a way to fill in the "update the s" gap which scores perfectly on training but is not what we really would have wanted. That is, you somehow need to ensure it learns the correct proxy without us completely specifying it, which is again a hard part of the problem. Maybe there's some practical reason (about the inductive biases of SGD etc.) to expect training with this outer objective to be more tractable than training with any rough approximation of our utility function (maybe because you can somehow vastly reduce the search space), but that'd be a whole different argument.

On another note, I do think this approach might be applicable as an outer shell to "soften the edges" when we already have an approximately correct solution to outer alignment (that is, when we already have a robust account of what constitutes "updating s", and also a good enough approximation to our utility function to be confident that our pool of approximate s contains it). In that sense, this seems functionally very reminiscent of Turner's low-impact measure AUP, which also basically "softens the edges" of an approximate solution by considering auxiliary s. That said, I don't expect both your approaches to coincide on the limit, since his is basically averaging over some s, and yours is doing something very different, as explained in your "Limit case" section.

Please do let me know if I've misrepresented anything :)