Trapping AIs via utility indifference

Stuart_Armstrong

If you wanted to check whether your creations were prone to misbehaving, one way would be to leave a big juicy apple hanging tantalisingly from a very visible tree, turn your back for a while, and see what they get up to.

That would work if you were God and your creations were essentially children. But if the creator is us and the creation is a super-intelligent AI, such 'honey-pots' and unlikely to work. They are tricks, and we shouldn't rely on being able to trick a being more intelligent than ourselves - unless its intelligence doesn't matter.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

"There is a stochastic event X which can output 0 or 1. We wish the AI act as if it believed that X=0, with certainty. It's problematic for the AI to have incorrect beliefs, so the result is achieved through the definition of the AI's utility function."

Consider the following situation: a master AI that will create the source code of another, autonomous AI. The master AI is setup to produce the code of a Friendly AI, or a reduced impact AI - but we're not sure if the master AI is trustworthy. It would be good to go through the code very carefully before unleashing it (or erasing it). But if the master AI knows this (and we must assume it does), then it will either obfuscate the dangerous parts or put traps in it, making it risky for us to look too closely.

Then implement the following setup. The code travels through some stochastic event X, such that if X=0, the code is immediately transferred into a disciple AI, completely autonomous, which is at liberty to immediately modify its own code and erase any trace of what it was. If X=1, the code is instead copied to our hard drives for analysis. The AI is given some reason to keep the code short - maybe there is an increasing chance of the code getting erased the longer it is.

On top of this, we implement utility indifference: the master AI will act as if it believed X=0 with certainty, while we have setup X so that X=1 with near-certainty. Then the master AI will be motivated to produce the shortest code compatible with its utility function, in the 'knowledge' that this code will be copied into a powerful autonomous AI, and will never be seen by humans. It has no incentive to trap its code - quite the reverse, as it wants to keep it short.

And therefore we have all the time in the world to go through the code, looking for defection. Other ways of using utility indifference to trap AIs are no doubt possible, but this was the first setup that sprang to my mind.

We can use utility indifference to construct such traps. Utility indifference can be informally summarised as:

I though of something that might come up before we even get to where we could use this trap.

I'm assuming we have to be reviewing the logs and code while we're coding it, just so that we can avoid things like "ERROR 22-322: Syntax error, expecting one of the following: ;, (, COMPRESS, DATA, FC, FORMCHAR, NLEVELS, NOPRINT, ORDER, PAGE." from basic typos.

And the first time we're running it that it doesn't just crash with an error to the log, we still have to look at the code, log, and basic output for problems that don't crash the program, like the classic failure "The utility function appears to have been multiplied by negative one."

There are generally options to supress external logs, and a self modifying program might add one of them to the code in an attempt to see if it made it run faster or by accident. But I would normally consider that in itself a bug and would kill the program and look at the code and the log before it went quiet.

The reason why is the log going quiet actually happens somewhat frequently in my programming language, and in many cases, the cause is a malformed infinite loop which does not itself generate output.

Something like;

x=0;

do until x>10;

*Increment X;

x=x-1;

end;

This is just a basic error. There's a -sign instead of a + sign in the increment loop. But x isn't becoming greater than 10.

So if my program stopped logging or outputting, I wouldn't even be ending it out of "Oh no, it's gone rampant!" I would just assume it was caught in an infinite loop, and would reset it and try to fix it.

It's as if God turns his back to see if Eve does anything interesting while he's not looking. God realizes that he hasn't heard anything from Eve in 30 minutes, and turns back around... and Eve has gotten herself trapped in a multimile deep pit, and is still digging. Because according to Eve's current brain, you go up by digging, and she needs to go up to take a closer look at the apple, so that she knows exactly what it is God told her not to eat.

I guess another way of expressing what I was considering is that turning your back on the AI might allow you to know what it does when it isn't being looked at, but it also allows it to get caught in useless churning.

So, if we were considering writing a self modifying AI, is there any way to guard it against this, so that we no longer have to look at the logs just for basic checks against churn, and we can get to the point that it is stable enough to consider variants like Stuart_Armstrong's idea to see if it reacts differently when logged compared to when unlogged? I understand how to do that for fixed code, but I'm not sure how you would do that for self modifying code.

5

Trapping AIs via utility indifference

5

5

5

Trapping AIs via utility indifference

5

5