I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods

So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.

Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.

My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.

With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.

I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:

Separate the Gatekeeper and the Observer

How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.

You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.

Afterwards, the Observer can make a low-detail report to the Gatekeeper.

(You might want to drug the Observer with something that prevents their memory from working too well...)

Automatic Testing

This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.

The Lesson

I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.

New to LessWrong?

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 5:21 AM

Seems to me like you should go and learn the subject matter and what smarter people wrote about it. Like Bostrom, for example.

Could you give a link or something?

By letting it talk to the Gatekeeper.

I think you missed a "not" there.

Hmm, if information is still supposed to be able to go to the gatekeeper then the talking person is effectively a gatekeeper too. However it's not a bad idea to make a box within a box. I got stuck on thinking why we need the gatekeeper to receive information, but in the end it's pretty simple. We need to accept the friendly AI. But then you could say that how can you trust someone to be your ally without gaining information about their values? This might be an oxymoron ie that by definition you can't. Thus the box should be anything but black and as white as other considerations allow.

There are other reasons than to check if the AI is friendly. AI, like other software, would have to be tested pretty thoroughly. It would be hard to make an AI if we can't test it without destroying the world.

Isn't that only a subcase of friendliness testing?

Not really. If you have an AI where you're not sure if it is completely broken or just unfriendly, you might want to test it, but without proper boxing you still risk destroying the world in the unlikely case that the AI works.