A putative new idea for AI control; index here.

EDIT: the mathematics of the post can be found here.

In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn't realise the power of the idea.


In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of 'false beliefs', the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.

Many (including Toby Ord) has suggested variants on the 'honey pot' idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we'd expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.

The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.

But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world's nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs - and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.

And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no "almost" in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn't have.

Now, we can't use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.

New Comment
3 comments, sorted by Click to highlight new comments since:

If you ask questions of the form "will the AI do X in situation Y" the answer will usually be the answer you want to hear for a UDT AI. This is because the AI if hypothetically presented with situation Y would predict the same situation might be used for testing it. This makes the whole thing a Newcomb problem from its point of view and it will behave itself until presented an opportunity for exploitation (that is, a situation outside the test). The idea might be salvaged if instead of considering possible actions of the AI we consider possible results of processes outside the AI's control.

Btw, I haven't fully grokked your method of getting honest answers but another method is using a safe math oracle or an unsafe oracle in a cryptographic box (as long as the answer to the question is sufficiently compact to prevent the unsafe oracle from exploiting this access to the outside world).

EDIT: Although the "safe" math oracle approach is of limited use since if the question contains anything hinting at the unsafe AI the math oracle might decide to run the unsafe AI's code.

This is because the AI if hypothetically presented with situation Y would predict the same situation might be used for testing it.

If we use the modify utility version of probability change, the AI cares only about those (very very very rare) universes in which X is what it wants and it's output is not read.

When you say "we can check what would happen if the AI was given the ability to launch the world’s nuclear arsenals", do you mean the question you're really asking is "would it be a good (from the AI's perspective) if the nuclear missiles were launched as a result of some fluke"? Because if you're literally asking "would the AI launch nuclear missiles" than you run into the Newcomb obstacle I described since the AI that might launch nuclear missiles is the "wild type" AI without special control handicaps.

Also, there are questions for which you don't want to know the detailed answer. For example "the text of a speech of an important world leader" that the AI would create is something to which you don't want to expose your own brain.