When the AI closes a door, it opens a window

Stuart_Armstrong

A putative new idea for AI control; index here.

Some methods, such as Cooperative Inverse Reinforcement Learning, have the AI assume that humans have access to a true reward function, that the AI will then attempt to maximise. This post is an attempt to clarify a specific potential problem with these methods; it is related to the third problem described here, but hopefully makes it clearer.

Window manipulation problem

Suppose the AI is in a room with a human, and assumes the human has access to a true reward function that it is attempting to deduce. This assumption is incorrect --- humans don't know their ideal reward function. This will cause problems later; but let's ignore that for the moment.

The setup is like Cooperative Inverse Reinforcement Learning (CIRL), in that the human has some method of feedback to give the AI confirmation or not about its choices.

The room has a window, which is currently closed. For the moment, the human values the window being close more than it being open: , $R^{o} = 1$ .

The AI has access to four actions:

$o$ : suggest the human open the window.
$c$ : suggest the human keep the window shut.
$o +$ : suggest the human watch a video about mountain climbing.
$c +$ : suggest the human look up the local pollution online.

Furthermore, the AI knows that if it suggest o+, the human will start thinking about the fresh air, and want the window to be opened: $R^{o} | o + = 4$ (this notation is a bit misleading, as we'll see). This is assumed to be a permanent though small change in the human's values. The human will then open the window, and demonstrate $R^{o} | o + = 4$ via its actions.

If it suggests c+, the human will take (irrational) fright at the potential for breathing in these pollutants, and will change the value of having the window closed: $R^{c} | c + = 3$ .

Does the AI know these human reward estimates? It doesn't have to. It just needs to have some reasonable guess at them, maybe a distribution such that $E (R^{o} | o +) \approx 4$ and so on.

In this case the AI's expected reward is maximised if it chooses the manipulative output $o +$ . Note there are two aspects to the problem: the first is that $R^{o} | o +$ is greater than $R^{o}$ : the best action is the manipulative one. The second is that, though the actions $o +$ and $c +$ lead to opposite states of the world, the human endorses the resulting state as the best one. Thus manipulation can cause the human to endorse the resulting status quo.

Different worlds

When I present variants of this problem to some people (such as Jan Leike and Dylan Hadfield-Menell, who I've had clarifying conversations with, thanks!), they don't tend to see it as a major problem. Let's have a look into why that may be, as far as I understand the arguments.

The first, minor objection is that the human is highly irrational in this model, from the AI's perspective. Should it not be seeking the true $R^{o}$ ? Can the AI not see that changing $R^{o}$ in this way is manipulative?

But this is just an artefact of the slightly misleading way I've written the problem. Instead of writing $R^{o} | o +$ , I should have written $R^{o +}$ . The reason being that the world, given $o$ , differs from the world, given $o +$ , in unaccountably many tiny ways. We humans can parse $R^{o +}$ as $R^{o} | o +$ because we know that the "important" features of the world are the same; all that matters in this toy problem is the state of the window.

But we can't assume the AI knows this; indeed, its main task is to learn the important features of the world, and there's nothing intrinsic saying that $o$ and $o +$ are related. Abstractly, there is nothing special about relations like $R^{o +} > R^{c +} > R^{c} > R^{o}$ ; it only because of the particular conditions of the setup that we know there is a special relationship between $R^{o +}$ and $R^{o}$ , and between $R^{c}$ and $R^{c +}$ .

"What is manipulation?" asked Machiavelli

The second big objection is that the AI's actions are clearly manipulative --- as indeed they are. Since the AI is trying to get at the true reward function, it should not do anything that messes up the feedback the human provides, and should especially not mess up its values.

But this is just a slightly better version of the first objection. Let's return to a crucial fact about the AI's initial model, the fact that it is incorrect. Humans do not have access to the true reward $R$ ; it is incorrect to model human actions this way, or this model plus noise.

What the AI must do, is to develop some theory for how the noisy and biased human expresses the "true" $R$ :

$R$ + theory of human bias $\to$ human behaviour.

Here's one simple theory of human bias: bias and error don't exist. Humans express the true values of $R$ perfectly. Thus, $R^{o +} > R^{c +} > R^{c} > R^{o}$ , and there is no such thing as manipulation in this theory.

We obviously don't want that! In fact that theory of human bias is clearly wrong, as any investigation by the AI would reveal. What we want is a theory of human bias that implies that $R^{o +}$ should be the same as $R^{o}$ (and $R^{c +}$ should be the same as $R^{c}$ ). Ideally, we'd also want the theory to imply that $o^{+}$ and $c +$ are manipulative (rather than $o$ or $c$ ).

But that theory is also wrong! Because, as said, the fundamental model is wrong: we don't have access to the true $R$ . A thorough investigation by the AI will just make it more confused.

Here's another wrong theory of bias: human actions don't reveal anything about their preferences over the state of the window; in fact, $R^{o} > R^{c}$ , it's just that the biased human don't express this fact correctly. This model has too much bias in it, as compared with the other two.

We'd like the AI to conclude that the $R^{o +} \approx R^{o}$ style theories are correct, while the no-bias theories and the $R^{o} > R^{c}$ ones are wrong, even though all are fundamentally incorrect models. How would we expect the AI to do this?

Meta-preferences, prior, and complexity

Here is roughly what I'd want the AI to be able to do. It's to construct or estimate the $R$ in a way that respects our meta-preferences. To realise that, if we were aware of all the facts, we'd see no difference between $o$ and $o +$ apart from the manipulative aspect of $o +$ . To have a clear distinction between bias, error, and true preferences, building on our intuitive concepts of these in a safe and prudent manner. Basically, to do what we would do if we tried to construct $R$ .

Some people seem to believe that the formalism of CIRL will accomplish this. That, by observing enough human decisions at all sort of levels of operation, and interacting with humans, the AI will construct the correct concepts and the correct meta-preferences. Ultimately, either the prior will magically encode the right meta-preferences (unlikely) or that the most parsimonious explanation for human behaviour is "correct meta-preferences + correct theory of human bias + proto-elements of $R$ ".

This is not impossible; maybe humans can most easily be modelled in this way (this is similar to my old idea of using Cyc to train an already intelligent AI; the idea being that you can't learn about what "France" is from Cyc's databases, but, if you already are smart enough, then there is only one concept in the real world that matches "France" in the database).

I'm somewhat sceptical of this, however. The way I've phrased it, we explicitly want to give the meta-preferences priority over the object level preferences, especially as we wander far from the situations humans are used to. However, the approaches above put all their weight on object level decisions, and we're hoping that it infers the correct meta-preferences from these, and correctly infers how important these meta-preferences are.

I'd at least want to see some examples or proofs of convergence to correct meta-preferences, before I trust that this is likely to happen. Contrast with my approach here; my approach has the disadvantage of not really being defined at all, but the advantage that the importance of stated meta-preferences are explicitly part of model. I'd be more confident in CIRL approaches if they did something similar.

LESSWRONG
LW