AI indifference through utility manipulation

Stuart_Armstrong

8 AI indifference through utility manipulation

2nd Sep 2010

5 min read

8

Indifference is a precious and rare commodity for complex systems. The most likely effect of making a change in an intricate apparatus is a whole slew of knock-on effects crowned with unintended consequences. It would be ideal if one could make a change and be sure that the effects would remain isolated - that the rest of the system would be indifferent to the change.

For instance, it might be a sensible early-AI precaution to have an extra observer somewhere, sitting with his hand upon a button, ready to detonate explosives should the AI make a visible power grab. Except, of course, the AI will become aware of this situation, and will factor it in in any plans it makes, either by increasing its deception or by grabbing control of the detonation system as a top priority. We would be a lot safer if the AI were somehow completely indifferent to the observer and the explosives. That is a complex wish that we don't really know how to phrase; let's make it simpler, and make it happen.

Assume that there is an AI that is implementing a utility function U, and that is faces the choice between cooperating with humans or defecting and taking over. There is a single human observer H that will press a button in one hour's time if it perceives the AI defecting. The button will activate a quantum observation X that will output 1 with probability p and 0 with probability (1-p). If X outputs 1, explosives will shatter the AI (this p is important, and we'll see later why we don't simply make the explosives go off with certainty).

Now the AI has the choice between cooperating, defecting while taking control of human H (by physical or psychological means, or simply by distracting vim), or defecting without taking control of H. Call these C, D+H and D-H. There are four possible outcomes: AI in control (A), Business as usual (B) with the AI cooperating, Explosion (E), and AI Failure to take over for non-explosive reasons (F).

Give the AI a 1% chance of F if it implements D-H, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:

Decision:	Outcomes	Probability
C	B	1
D+H	F	0.02
D+H	A	0.98
D-H	F	0.01
	E	0.99(p)
	A	0.99(1-p)

Is there a way to make the AI indifferent to the explosives - which we can interpret as being indifferent to the value of p? If U(E)=U(A), then this would be the case. So if we applied a filter F to the utility function U to generate a new utility function F(U) such that:

F(U)=U on A,B and F.
F(U)(E)=U(A) (which is also F(U)(A)).

Then replace the AI's utility U with F(U). Now if the AI ever gets tempted to defect, it will always follow the D-H route, as D+H gives a lower expected utility. Then by making p arbitrarily close to 1 we could be nearly absolutely certain that we can stop it when it defects.

Why not simply remove p and X entirely, and just design the explosives to go off with certainty? In this toy model, this can certainly be done, as world A appears as a consequence of both D+H and D-H; but in general, the two will not be exactly identical. Moreover, when there are a whole family of outcomes, rather than the simple ones presented here, it is not clear what we should set the utility of (the family of worlds corresponding to) E to. The requirement "act as if there were no explosives there" is complex, but the requirement "act as if X=1 were the same as X=0" is simple - as long as both outcomes are possible on some level.

A more complicated version

The example above is very naive - after all, if we knew all the utility values and if the outcomes decomposed so neatly, we would set all utilities to zero apart from that of B. In reality, U will be much more complicated, and the goodness or badness of the outcomes will not be obvious to us.

My first instinct was to define the set S₀ of all worlds W₀ where X=0, and compute EU₀, the expected utility of S₀ given X=0. Similarly for S₁ and EU₁. Then for any worlds W₀ and W₁ in S₀ and S₁, I'd define the utility filter F as:

F(U)(W₀)=U(W₀).
F(U)(W₁)=U(W₁) - EU₁ + EU₀.

This accomplishes the same result result, making the AI indifferent between X=1 and X=0. Nick Bostrom pointed out a flaw, however: extra information may break the indifference. If there were equally likely worlds worlds W₀, W₀', W₁, and W₁' with initial utilities 2, 0, 0 and 0, then the filter F would change these values to 2, 0, 1 and 1. Then if the AI gained a piece of information that ruled out W₀' and W₁', it would be left with expected utility 2 for X=0 and 1 for X=1 - the indifference is broken.

So it seems that we should match worlds individually, rather than in sets - but worlds will continue diverging after observation X, and how do you match up a particular world in which the AI was blown up with a particular one in which it has reached domination? Match up number of humans versus number of paper clips?

Instead, define subsets T_i of S₀ ∪ S₁ that share the same stem - T_i consists of all those worlds that cannot be distinguished by any observation the AI can make before event X. Then let EU_i0 be the expected utility of S₀ ∩ T_i given X=0, and EU_i1 the expected utility of S₁ ∩ T_i given X=1.

Then for worlds W₀ and W₁ in S₀ ∩ T_i and S₁ ∩ T_i, respectively, defined F as:

F(U)(W₀)=U(W₀).
F(U)(W₁)=U(W₁) - EU_i1 + EU_i0.

Again, indifference is restored, and - by definition - this indifference cannot be broken by AI observations before X. In the limiting case, we would want T_i to consist of worlds that are indistinguishable before X not only in practice but also in theory.

The utility filter and self-improvement

A utility filter has many advantages compared with other methods - first, it can be added atop any initial utility U, without considerations as to what U is. Secondly, since it is a new utility, it does not break the expected utility formulation, and therefore will not be inevitably discarded by self improvement.

It seems quite likely that if the AI starts with utility F(U_j) and chooses to upgrade it to utility U_k to better further its current goals, then F(U_k) will further its current goals at least as well. Hence we would probably get F(U_k)=U_k. Also, the filter F could be kept quite distinct from the rest of the utility function, open to our observation. It would be quite tricky for the AI to seek to break F, and conceal this from us, while implementing an F-stable initial utility function.

This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that non-indifferent AI's don't.

Personal Blog

8

New Comment

Rendering 0/54 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 9:12 PM

Moderation Log

8 AI indifference through utility manipulation

by Stuart_Armstrong

2nd Sep 2010

5 min read

8

Give the AI a 1% chance of F if it implements D-H, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:

Decision:	Outcomes	Probability
C	B	1
D+H	F	0.02
D+H	A	0.98
D-H	F	0.01
	E	0.99(p)
	A	0.99(1-p)

F(U)=U on A,B and F.
F(U)(E)=U(A) (which is also F(U)(A)).

A more complicated version

F(U)(W₀)=U(W₀).
F(U)(W₁)=U(W₁) - EU₁ + EU₀.

Then for worlds W₀ and W₁ in S₀ ∩ T_i and S₁ ∩ T_i, respectively, defined F as:

F(U)(W₀)=U(W₀).
F(U)(W₁)=U(W₁) - EU_i1 + EU_i0.

The utility filter and self-improvement

This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that non-indifferent AI's don't.

Personal Blog

8

Mentioned in

35Long-Term Future Fund: August 2019 grant recommendations

30Proper value learning through indifference

New Comment

Rendering 0/54 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 9:12 PM

Moderation Log

More from Stuart_Armstrong

Curated and popular this week

54Comments

Comment Permalink

RolfAndreassen16y60

This is very fine provided you know which part of the AI's code contains the utility function, and are certain it's not going to be modified. But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn't need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility. Consequently, if you have an AI, it is because your utility calculation is not a separable piece of code, but some sort of global function of a huge number of inputs and internal calculations. How can you apply a filter to that?

You've assumed away the major difficulty, that of knowing what the AI's utility function is in the first place! If you can simply inspect the utility function like this, there's no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don't want.

If you know the utility function, you have no need to filter it. If you don't know it, you can't filter it.

timtyler16y10

But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn't need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility.

That sounds as though it is just an intelligent machine which has been crippled by being forced to act through a human body.

You suggest that would be better - but how?

0Stuart_Armstrong16y

Good comment. [...] Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don't even know what we want, and we can't imagine some of the things that we don't want. Yes, we do need to have some idea of what U is - or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn't need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be... It seems plausible that F could be implemented by a simple subroutine even across self-improvement.

See in context