Trapping AIs via utility indifference

Stuart_Armstrong

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

Except the AI that's worth it's salt as far as danger goes does not, in fact, calculate EU(Z) or EU(A). It did not produce a function that calculates expected overall utility of a move, because it couldn't, it takes too much computing power, it's a bad approach. It did look at the final board state's utility function (win/draw/loss one), and it did look at the rules of the game, and it did some thinking - how can I , without being able to calculate EU(Z) , make moves that would work - and came up with an approach based on that function. (Incidentally this approach is applicable only to fairly simple utility functions of some future state.)

An AI needs to be programmed in a specific, very under optimized way to allow you to make that sort of modification you're proposing here.

Keep in mind that neat real valued utility functions are an entirely abstract, idealized model, used to reason about idealized decisionmaking by an agent that got infinite computing power and such. The real world AI has limited computing power, and the name of the game is to make the best use of computing power available, which means making decisions that help to increase the utility without ever calculating the utility directly or doing some comparisons between real numbers. Such AI, running under an utility function will have a lot of code that's derived to help increase utility but doesn't do it by calculating the utility. Then it would be impossible to just change it. Efficient code is unflexible.

Furthermore, sufficiently inefficient AI - such as idealized utility maximizing one where you just go ahead and replace one utility with another - which doesn't self optimize beyond naive approach - is not much of a threat, even having a lot of computational power. The trees expand exponentially with depth. The depth is logarithmic in computing power.

edit: here is an example. The utility maximization and utility functions are to practical (and scary) AI as quantum chromodynamics is to practical computer graphics software I do for living . That is to say, you would probably have as much luck modifying AI's behaviour by editing utility functions as you'd have of editing my cloud renderer to draw pink clouds by using modified standard model.

it did look at the rules of the game, and it did some thinking - how can I , without being able to calculate EU(Z) , make moves that would work - and came up with an approach based on that function. (Incidentally this approach is applicable only to fairly simple utility functions of some future state.)

And that's where it comes up with: play randomly for three moves, then apply the material advantage process. This maximises the new utility function, without needing to calculate EU(Z) (or EU(A)).

An AI needs to be programmed in a specific, very under opt

... (read more)

5

Trapping AIs via utility indifference

5

5

5

Trapping AIs via utility indifference

5

5