Trapping AIs via utility indifference

Stuart_Armstrong

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

Now consider AI that's playing chess from first principles. It values victory over tie, and tie over loss . The values can be 1, 0, -1. But it can only see up to 10 moves ahead, and there's no victory, tie, or loss so soon. So it thinks and thinks and comes up with a strategy: it makes a second utility function - the material advantage 10 moves ahead - which it can maximize now, and working to maximize which is likely to bring it to the victory, even though AI does not know exactly how.

Now, if you try to change the utility function for the move in the now, the AI will reason that your utility function is going to make it lose the game. And you can't do this trick to AI's ultimate utility because AI does not itself know how a move will affect ultimate utility, the AI did not bother to calculate the values for you to set to the equal. It did not even bother to calculate what is the effect of strategy on ultimate utility. It just generated a strategy starting from the ultimate utility (not by trying a zillion strategies and calculating their guessed impact on final utility).

It can be said that AI optimized out the two real numbers (utilities) and their comparison. When you have a goal to maximize future utility, it doesn't mean you'll be calculating real numbers to infinite precision, and then comparing them, in the idealized agent way, to pick between moves. You can start from future utility, think about it, and come up with some strategy in the now that will work to maximize future utility, even though it doesn't calculate future utility.

I myself worked as 'utility maximizing agent' trying to maximize the accuracy of my computer graphics software. I do so typically without calculating impact of code i write on final accuracy - it is impractical - i can however reason about what actions will make it larger than other actions would - again, most often than not, without calculating some two real numbers and then comparing them.

And you can't do this trick to AI's ultimate utility because AI does not itself know how a move will affect ultimate utility

You would do it for the AIs ultimate utility (maybe, for instance, making it indifferent to the third move), and let the AI implement the best sub-strategy it could. Then it would take into account the "third move is special" when designing it's material advantage (easiest way of doing this: the first three moves are irrelevant).

5

Trapping AIs via utility indifference

5

5

5

Trapping AIs via utility indifference

5

5