I just noticed that LessWrong has not yet linked to FHI researcher Stuart Amstrong's brief technical report, Utility Indifference (2010). It opens:
Consider an AI that follows its own motivations. We're not entirely sure what its motivations are, but we would prefer that the AI cooperate with humanity; or, failing that, that we can destroy it before it defects. We'll have someone sitting in a room, their finger on a detonator, ready at the slightest hint of defection.
Unfortunately as has been noted... this does not preclude the AI from misbehaving. It just means that the AI must act to take control of the explosives, the detonators or the human who will press the button. For a superlatively intelligence AI, this would represent merely a slight extra difficulty. But now imagine that the AI was somehow indifferent to the explosives going off or not (but that nothing else was changed). Then if ever the AI does decide to defect, it will most likely do so without taking control of the explosives, as that would be easier than otherwise. By "easier" we mean that the chances of failure are less, since the plan is simpler... recall that under these assumptions, the AI counts getting blown up as an equal value to successfully defecting. How could we accomplish this indifference?
I certainly agree with the conclusion of the note:
'Indifference', remember, is just 'U(S) = U(T )' That is, a successful defection equals the value of the explosion going off? This is really problematic depending on how we cash out 'defection'. For example, suppose defection involves taking over everything on Earth including the detonator - then defection looks better than taunting, since there are two avenues to 'U(S)=U(T)':
Another thought: I can't quite follow all the formalism, but the key to indifference seems to be defining a set of worlds where the detonator set up remains unchanged. Doesn't this encourage the AI to seize control of the detonator to make sure it does not change (ie. detonate)? (If we expand the set of detonator-worlds to include worlds where it has gone off and killed the AI, that might fix this issue, but adding in an explicit high valuing of successful suicide seems like it would not be very useful...)
The AI will never have any active desire to push the detonator (barring other reasons like someone saying "push the detonator, and I will donate $1 to the AI).
And this will not stop the AI from defecting, not at all. It will, however, ensure that while defecting, the detonator will not be a priority - it's effectively just an inert lump from the AI's persepective. So the AI will try and grab the nuclear missiles, or hack the president, or whatever, but the guy in the shed by the explosives is low down on the list. Maybe low down enough that they'll be able to react on time.