Indifference is a precious and rare commodity for complex systems. The most likely effect of making a change in an intricate apparatus is a whole slew of knock-on effects crowned with unintended consequences. It would be ideal if one could make a change and be sure that the effects would remain isolated - that the rest of the system would be indifferent to the change.
For instance, it might be a sensible early-AI precaution to have an extra observer somewhere, sitting with his hand upon a button, ready to detonate explosives should the AI make a visible power grab. Except, of course, the AI will become aware of this situation, and will factor it in in any plans it makes, either by increasing its deception or by grabbing control of the detonation system as a top priority. We would be a lot safer if the AI were somehow completely indifferent to the observer and the explosives. That is a complex wish that we don't really know how to phrase; let's make it simpler, and make it happen.
Assume that there is an AI that is implementing a utility function U, and that is faces the choice between cooperating with humans or defecting and taking over. There is a single human observer H that will press a button in one hour's time if it perceives the AI defecting. The button will activate a quantum observation X that will output 1 with probability p and 0 with probability (1-p). If X outputs 1, explosives will shatter the AI (this p is important, and we'll see later why we don't simply make the explosives go off with certainty).
Now the AI has the choice between cooperating, defecting while taking control of human H (by physical or psychological means, or simply by distracting vim), or defecting without taking control of H. Call these C, D+H and D-H. There are four possible outcomes: AI in control (A), Business as usual (B) with the AI cooperating, Explosion (E), and AI Failure to take over for non-explosive reasons (F).
Give the AI a 1% chance of F if it implements D-H, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:
Decision: | Outcomes | Probability |
---|---|---|
C |
B |
1 |
D+H |
F |
0.02 |
A | 0.98 |
|
D-H | F | 0.01 |
E | 0.99(p) |
|
A | 0.99(1-p) |
Is there a way to make the AI indifferent to the explosives - which we can interpret as being indifferent to the value of p? If U(E)=U(A), then this would be the case. So if we applied a filter F to the utility function U to generate a new utility function F(U) such that:
- F(U)=U on A,B and F.
- F(U)(E)=U(A) (which is also F(U)(A)).
Then replace the AI's utility U with F(U). Now if the AI ever gets tempted to defect, it will always follow the D-H route, as D+H gives a lower expected utility. Then by making p arbitrarily close to 1 we could be nearly absolutely certain that we can stop it when it defects.
Why not simply remove p and X entirely, and just design the explosives to go off with certainty? In this toy model, this can certainly be done, as world A appears as a consequence of both D+H and D-H; but in general, the two will not be exactly identical. Moreover, when there are a whole family of outcomes, rather than the simple ones presented here, it is not clear what we should set the utility of (the family of worlds corresponding to) E to. The requirement "act as if there were no explosives there" is complex, but the requirement "act as if X=1 were the same as X=0" is simple - as long as both outcomes are possible on some level.
A more complicated version
The example above is very naive - after all, if we knew all the utility values and if the outcomes decomposed so neatly, we would set all utilities to zero apart from that of B. In reality, U will be much more complicated, and the goodness or badness of the outcomes will not be obvious to us.
My first instinct was to define the set S0 of all worlds W0 where X=0, and compute EU0, the expected utility of S0 given X=0. Similarly for S1 and EU1. Then for any worlds W0 and W1 in S0 and S1, I'd define the utility filter F as:
- F(U)(W0)=U(W0).
- F(U)(W1)=U(W1) - EU1 + EU0.
This accomplishes the same result result, making the AI indifferent between X=1 and X=0. Nick Bostrom pointed out a flaw, however: extra information may break the indifference. If there were equally likely worlds worlds W0, W0', W1, and W1' with initial utilities 2, 0, 0 and 0, then the filter F would change these values to 2, 0, 1 and 1. Then if the AI gained a piece of information that ruled out W0' and W1', it would be left with expected utility 2 for X=0 and 1 for X=1 - the indifference is broken.
So it seems that we should match worlds individually, rather than in sets - but worlds will continue diverging after observation X, and how do you match up a particular world in which the AI was blown up with a particular one in which it has reached domination? Match up number of humans versus number of paper clips?
Instead, define subsets Ti of S0 ∪ S1 that share the same stem - Ti consists of all those worlds that cannot be distinguished by any observation the AI can make before event X. Then let EUi0 be the expected utility of S0 ∩ Ti given X=0, and EUi1 the expected utility of S1 ∩ Ti given X=1.
Then for worlds W0 and W1 in S0 ∩ Ti and S1 ∩ Ti, respectively, defined F as:
- F(U)(W0)=U(W0).
- F(U)(W1)=U(W1) - EUi1 + EUi0.
Again, indifference is restored, and - by definition - this indifference cannot be broken by AI observations before X. In the limiting case, we would want Ti to consist of worlds that are indistinguishable before X not only in practice but also in theory.
The utility filter and self-improvement
A utility filter has many advantages compared with other methods - first, it can be added atop any initial utility U, without considerations as to what U is. Secondly, since it is a new utility, it does not break the expected utility formulation, and therefore will not be inevitably discarded by self improvement.
It seems quite likely that if the AI starts with utility F(Uj) and chooses to upgrade it to utility Uk to better further its current goals, then F(Uk) will further its current goals at least as well. Hence we would probably get F(Uk)=Uk. Also, the filter F could be kept quite distinct from the rest of the utility function, open to our observation. It would be quite tricky for the AI to seek to break F, and conceal this from us, while implementing an F-stable initial utility function.
This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that non-indifferent AI's don't.
I want to second RolfAndreassen' viewpoint below.
The problem with this entire train of thought is that you completely skip past the actual real difficulty, which is constructing any type of utility function even remotely as complex as the one you propose.
Your hypothetical utility function references undefined concepts such as "taking control of", "cooperating", "humans", and "self", etc etc
If you actually try to ground your utility function and go through the work of making it realistic, you quickly find that it ends up being something on the order of complexity of a human brain, and its not something that you can easily define in a few pages of math.
I'm skeptical then about the entire concept of 'utility function filters', as it seems their complexity would be on the order of or greater than the utility function itself, and you need to keep constructing an endless sequence of such complex utility function filters.
A more profitable route, it seems to me, is something like this:
Put the AI's in a matrix-like sim (future evolution of current computer game & film simulation tech) and get a community of a few thousand humans to take part in a Truman Show like experiment. Indeed, some people would pay to spectate or even participate, so it could even be a for profit venture. A hierarchy of admins and control would ensure that potential 'liberators' were protected against. In the worst case, you can always just rewind time. (something the Truman Show could never do - a fundamental advantage of a massive sim)
The 'filter function' operates at the entire modal level of reality: the AI's think they are humans, and do not know they are in a sim. And even if they suspected they were in a sim (ie by figuring out the simulation argument), they wouldn't know who were humans and who were AI's (and indeed they wouldn't know which category they were in). As the human operators would have godlike monitoring capability over the entire sim, including even an ability to monitor AI thought activity, this should make a high level of control possible.
They can't turn against humans in the outside world if they don't even believe it exists.
This sounds like a science fiction scenario (and it is), but it's also feasible, and I'd say far more feasible than approaches which directly try to modify, edit, or guarantee mindstates of AI's who are allowed to actually know they are AIs.
If you allow the AIs to know what humans are like, then it won't take them more than a few clicks to figure out they're not human. And if they don't know what humans are like - well, we can't ask them to answer much in the ways of human questions.
Even if t... (read more)