'Utility Indifference' (2010) by FHI researcher Stuart Armstrong

lukeprog

I just noticed that LessWrong has not yet linked to FHI researcher Stuart Amstrong's brief technical report, Utility Indifference (2010). It opens:

Consider an AI that follows its own motivations. We're not entirely sure what its motivations are, but we would prefer that the AI cooperate with humanity; or, failing that, that we can destroy it before it defects. We'll have someone sitting in a room, their finger on a detonator, ready at the slightest hint of defection.

Unfortunately as has been noted... this does not preclude the AI from misbehaving. It just means that the AI must act to take control of the explosives, the detonators or the human who will press the button. For a superlatively intelligence AI, this would represent merely a slight extra difficulty. But now imagine that the AI was somehow indifferent to the explosives going off or not (but that nothing else was changed). Then if ever the AI does decide to defect, it will most likely do so without taking control of the explosives, as that would be easier than otherwise. By "easier" we mean that the chances of failure are less, since the plan is simpler... recall that under these assumptions, the AI counts getting blown up as an equal value to successfully defecting. How could we accomplish this indifference?

I just noticed that LessWrong has not yet linked to FHI researcher Stuart Amstrong's brief technical report, Utility Indifference (2010). It opens:

Consider an AI that follows its own motivations. We're not entirely sure what its motivations are, but we would prefer that the AI cooperate with humanity; or, failing that, that we can destroy it before it defects. We'll have someone sitting in a room, their finger on a detonator, ready at the slightest hint of defection.

Unfortunately as has been noted... this does not preclude the AI from misbehaving. It just means that the AI must act to take control of the explosives, the detonators or the human who will press the button. For a superlatively intelligence AI, this would represent merely a slight extra difficulty. But now imagine that the AI was somehow indifferent to the explosives going off or not (but that nothing else was changed). Then if ever the AI does decide to defect, it will most likely do so without taking control of the explosives, as that would be easier than otherwise. By "easier" we mean that the chances of failure are less, since the plan is simpler... recall that under these assumptions, the AI counts getting blown up as an equal value to successfully defecting. How could we accomplish this indifference?

So for the branches where it gets blown up, it instead computes expected utility for the counterfactual where the explosives are duds. I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds. Also, the mechanism might be dangerous in itself. After all, it's almost certainly going to create not just additional datacenters, but also also extend pieces of its intelligence into everyone's cell phones, cars, etc. Then you have to choose between letting minor defections slide, and disabling a bunch of probably-vital-to-society technology all at once. And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can't let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.

I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds.

Since it is indifferent to being blown up, it should build its successors in the same way - why would it want its siblings to care about something it doesn't?

And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can't let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.

Yep. This is nothing like a complete solution, and will most lik... (read more)

4

'Utility Indifference' (2010) by FHI researcher Stuart Armstrong

4

4

4

'Utility Indifference' (2010) by FHI researcher Stuart Armstrong

4

4