Trapping AIs via utility indifference

Stuart_Armstrong

5 Trapping AIs via utility indifference

28th Feb 2012

2 min read

5

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.

Personal Blog

5

New Comment

Rendering 0/33 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:33 AM

Moderation Log

5 Trapping AIs via utility indifference

by Stuart_Armstrong

28th Feb 2012

2 min read

5

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

Personal Blog

5

Mentioned in

51Original Research on Less Wrong

New Comment

Rendering 0/33 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 7:33 AM

Moderation Log

More from Stuart_Armstrong

Curated and popular this week

33Comments

Comment Permalink

HonoreDB14y00

This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.

If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can't influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).

Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U'(X=1) = 0. Since that's simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.

So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.

Stuart_Armstrong14y00

This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.

Ah, but of course :-)

I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially "I will reward you for modifying k").

Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event.

... (read more)

See in context