Indifference is a precious and rare commodity for complex systems. The most likely effect of making a change in an intricate apparatus is a whole slew of knock-on effects crowned with unintended consequences. It would be ideal if one could make a change and be sure that the effects would remain isolated - that the rest of the system would be indifferent to the change.
For instance, it might be a sensible early-AI precaution to have an extra observer somewhere, sitting with his hand upon a button, ready to detonate explosives should the AI make a visible power grab. Except, of course, the AI will become aware of this situation, and will factor it in in any plans it makes, either by increasing its deception or by grabbing control of the detonation system as a top priority. We would be a lot safer if the AI were somehow completely indifferent to the observer and the explosives. That is a complex wish that we don't really know how to phrase; let's make it simpler, and make it happen.
Assume that there is an AI that is implementing a utility function U, and that is faces the choice between cooperating with humans or defecting and taking over. There is a single human observer H that will press a button in one hour's time if it perceives the AI defecting. The button will activate a quantum observation X that will output 1 with probability p and 0 with probability (1-p). If X outputs 1, explosives will shatter the AI (this p is important, and we'll see later why we don't simply make the explosives go off with certainty).
Now the AI has the choice between cooperating, defecting while taking control of human H (by physical or psychological means, or simply by distracting vim), or defecting without taking control of H. Call these C, D+H and D-H. There are four possible outcomes: AI in control (A), Business as usual (B) with the AI cooperating, Explosion (E), and AI Failure to take over for non-explosive reasons (F).
Give the AI a 1% chance of F if it implements D-H, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:
Decision: | Outcomes | Probability |
---|---|---|
C |
B |
1 |
D+H |
F |
0.02 |
A | 0.98 |
|
D-H | F | 0.01 |
E | 0.99(p) |
|
A | 0.99(1-p) |
Is there a way to make the AI indifferent to the explosives - which we can interpret as being indifferent to the value of p? If U(E)=U(A), then this would be the case. So if we applied a filter F to the utility function U to generate a new utility function F(U) such that:
- F(U)=U on A,B and F.
- F(U)(E)=U(A) (which is also F(U)(A)).
Then replace the AI's utility U with F(U). Now if the AI ever gets tempted to defect, it will always follow the D-H route, as D+H gives a lower expected utility. Then by making p arbitrarily close to 1 we could be nearly absolutely certain that we can stop it when it defects.
Why not simply remove p and X entirely, and just design the explosives to go off with certainty? In this toy model, this can certainly be done, as world A appears as a consequence of both D+H and D-H; but in general, the two will not be exactly identical. Moreover, when there are a whole family of outcomes, rather than the simple ones presented here, it is not clear what we should set the utility of (the family of worlds corresponding to) E to. The requirement "act as if there were no explosives there" is complex, but the requirement "act as if X=1 were the same as X=0" is simple - as long as both outcomes are possible on some level.
A more complicated version
The example above is very naive - after all, if we knew all the utility values and if the outcomes decomposed so neatly, we would set all utilities to zero apart from that of B. In reality, U will be much more complicated, and the goodness or badness of the outcomes will not be obvious to us.
My first instinct was to define the set S0 of all worlds W0 where X=0, and compute EU0, the expected utility of S0 given X=0. Similarly for S1 and EU1. Then for any worlds W0 and W1 in S0 and S1, I'd define the utility filter F as:
- F(U)(W0)=U(W0).
- F(U)(W1)=U(W1) - EU1 + EU0.
This accomplishes the same result result, making the AI indifferent between X=1 and X=0. Nick Bostrom pointed out a flaw, however: extra information may break the indifference. If there were equally likely worlds worlds W0, W0', W1, and W1' with initial utilities 2, 0, 0 and 0, then the filter F would change these values to 2, 0, 1 and 1. Then if the AI gained a piece of information that ruled out W0' and W1', it would be left with expected utility 2 for X=0 and 1 for X=1 - the indifference is broken.
So it seems that we should match worlds individually, rather than in sets - but worlds will continue diverging after observation X, and how do you match up a particular world in which the AI was blown up with a particular one in which it has reached domination? Match up number of humans versus number of paper clips?
Instead, define subsets Ti of S0 ∪ S1 that share the same stem - Ti consists of all those worlds that cannot be distinguished by any observation the AI can make before event X. Then let EUi0 be the expected utility of S0 ∩ Ti given X=0, and EUi1 the expected utility of S1 ∩ Ti given X=1.
Then for worlds W0 and W1 in S0 ∩ Ti and S1 ∩ Ti, respectively, defined F as:
- F(U)(W0)=U(W0).
- F(U)(W1)=U(W1) - EUi1 + EUi0.
Again, indifference is restored, and - by definition - this indifference cannot be broken by AI observations before X. In the limiting case, we would want Ti to consist of worlds that are indistinguishable before X not only in practice but also in theory.
The utility filter and self-improvement
A utility filter has many advantages compared with other methods - first, it can be added atop any initial utility U, without considerations as to what U is. Secondly, since it is a new utility, it does not break the expected utility formulation, and therefore will not be inevitably discarded by self improvement.
It seems quite likely that if the AI starts with utility F(Uj) and chooses to upgrade it to utility Uk to better further its current goals, then F(Uk) will further its current goals at least as well. Hence we would probably get F(Uk)=Uk. Also, the filter F could be kept quite distinct from the rest of the utility function, open to our observation. It would be quite tricky for the AI to seek to break F, and conceal this from us, while implementing an F-stable initial utility function.
This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that non-indifferent AI's don't.
As far as I can tell, a utility function filtered in this way leads to more or less incomprehensible, and certainly undesirable, behavior. (I have more serious reasons to doubt the usefulness of this and similar lines of thinking, but perhaps this is the least debatable.)
Here is something I might do, if I were artificially made indifferent to dying tomorrow. Buy a random lottery ticket. If I lose, kill myself. Conditioned on me not dying, I just won the lottery, which I'm quite happy with. If I die, I'm also quite happy, since my expected utility is the same as not dying.
Of course, once I've lost the lottery I no longer have any incentive to kill myself, with this utility function. But this just shows that indifference is completely unstable, and given the freedom to self-modify I would destroy indifference immediately so that I could execute this plan (or more incomprehensible variants). If you try to stop me from destroying indifference, I could delude myself to accomplish the same thing, or etc.
Am I misunderstanding something?
Yes. If you are indifferent to dying, and you die in any situation, you get exactly the same utility as if you hadn't died.
If you kill yourself after losing the lottery, you get the same utility as if you'd lost the lottery and survived. If you kill yourself after winning the lottery, you get the same utility as if you'd won the lottery and survived. Killing yourself never increases or decreases your utility, so you can't use it to pull off any tricks.