A putative new idea for AI control; index here.

Defining truth and accuracy is tricky, so when I've proposed designs for things like Oracles, I've either used a very specific and formal question, or and indirect criteria for truth.

Here I'll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.


The basic idea is simple. The first AI wishes to communicate certain facts to a second AI, but has to use the human as an intermediary. The first AI talks to the human, and then the human talks with the second AI. If the facts are to be accurate, the human has to understand them.

Of course, this will immediately fail for a variety of reasons. Even if neither AI hacks the human directly, the human doesn't have to understand anything to serve as a conduit. So the rest of this post is dedicated to making that basic idea much more secure. If you just want to have the final idea and avoid the justifications, then you can skip directly to the end.

The knowledge to pass on

The human would have a good understanding of a subject, if they are able to answer a large quantity of questions about it. Having a long conversation with the second AI is tiring, among other things. A human simply won't be able to answer ten million questions about what they understand, at least not well.

So the idea could be to have the second AI ask a single question, randomly chosen from a large collection of questions, and have the human give a single yes or no answer. Both AIs are then graded on the accuracy of the answer.

The answer...

What does the second AI want to do? It basically wants to dissect the human (figuratively or literally), suck out all their knowledge, and infer what the first AI was trying to communicate. Note that this doesn't require the human to understand anything, just that the second AI can infer what the first one wanted by taking the human apart.

So instead of allowing the second AI unrestricted access to the human, it instead formulates its (boolean) question without interference, and then the human choose yes or no, and that choice is counted as an answer.

...and the question

We could have a list of actual questions, but that makes the second AI redundant, and means that the first AI will optimise for certain answers from a set of questions, not for genuine understanding.

Instead, the second AI then gets a technical boolean question about some distribution. Since the human will be deciding yes or no, and the AI will not be able to interact with them, it's in the second AI's interest to formulate a question that allows the human to understand the technical boolean question they're being asked. And then, using the human's private information, to answer it correctly.

The information from the first AI

What distribution will the second AI be asking a boolean question about? This can be anything, but I'm thinking of some conditional probability distribution along the lines of "what will the world look like if X happens?". This needs to be formally defined, but the AI's communication with the human is informal, and up to the AI. As long as the AI can't hack the human in the third sense, it needs to make them understand the distribution as best it can.

Then the second AI is given a randomly chosen boolean technical question about this distribution. Since it doesn't know what question will be asked, it's in the first AI's interest to ensure the human has a good general understanding of the distribution.

Making it all counterfactual

It would be very inefficient to have the human shuffle back and forth between AIs every time the first AI gives an answer. Instead, we could use the actual setup one time in a thousand or so, and let the first AI be motivated to get the correct answer in those one-in-a-thousand cases.

In summary

The whole setup can be summarised by this graphic:

There are some subtle questions about what distribution the first AI should be using versus the questions the second one can ask so as to guarantee genuine human understanding of issues the human cares about. There's also some issues about how the setup fits into siren worlds, and other vulnerabilities. But I'll defer that analysis to another time.

New to LessWrong?

New Comment