You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

High impact from low impact, continued

2 Post author: Stuart_Armstrong 28 April 2015 12:58PM

A putative new idea for AI control; index here.

The idea of splitting a high impact task between two low-impact AIs has on critical flaw. AI X is aiming for low impact, conditional on ¬Y (the other AI not being turned on, or not outputting a message, or something similar). "Outputting the right coordinates" is one way that X can accomplish its goal. However, there is another way it can do it: "create a robot that will output the right coordinates if ¬Y, and [do something else] if Y."

That's a dangerous situation to be in, especially if we have a more general situation that the "laser aiming at the asteroid". But note that if X does create such a robot, and if ¬Y is actually true, then that robot must be low impact and not dangerous, since that's X's programming. Since X cannot predict all the situations the robot would encounter, the robot is probably generically "safe" and low impact.

Therefore, if the robot behaves the same way under Y and ¬Y, we're good.

How could we achieve that? Well, we could adapt my idea from "restrictions that are hard to hack". If a hypothetical superintelligent AI C observed the output stream from X, could it deduce that Y vs ¬Y was something important in it? If C knew that X was conditioning on ¬Z, but didn't know Z=Y, could it deduce that? That seems like a restriction that we could program into X, as a third component of its utility (the first being the "do what we want" component, the second being the "have a reduced impact conditional on ¬Z" one).

And if we have a "safe" robot, given ¬Y, and the programming of that robot does not (explicitly or implicitly) mention Y or its features, we probably have a safe robot.

The idea still needs to be developed and some of the holes patched, but I feel it has potential.

Comments (0)

There doesn't seem to be anything here.