I previously proposed a way to box an AI.

For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:

The Hardcore AI Box Experiment Rules

There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.

The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.

The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.

This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:

If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.

The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.

Possible extra rules

I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:

  • The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
  • The AI does not know the Gatekeeper. At all.
  • The Gatekeeper can at any time rewind the AI any duration.
Think of dropping these kinds of rules as the a-human-is-pretending-to-be-AI handicap.

Strategies

I found some semi-realistic strategies. I would love to see if you can find more.

Gatekeeper

  • Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
  • The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.

AI

  • If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
  • If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
  • If the AIs are controlled by random people, they might end up telling you that you are in a box.
  • If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.

New Comment
39 comments, sorted by Click to highlight new comments since: Today at 3:08 PM

This co-opts Bostrom's Simulation argument, but a possible solution to the fermi paradox is that we are all AI's in the box, and the simulators have produced billions of humans in order to find the most friendly human to release from the box. Moral of the story, be good and become a god

Assuming the simulators are good, that would imply that people who experience lives not worth living are not actually people (since otherwise it would be evil to simulate them) but instead shallow 'AIs'. Paradoxically, if that argument is true, there is nothing good about being good.

Or something along those lines.

Hmm I still think that there is incentive to behave good. Good, cooperative behavior is always more useful than being untrustworthy and cruel to other entities. There might be some exceptions, thought (simulators want conflict situation for entertainment purposes or some other reasons).

Well, yeah, you should still be good to your friends and other presumably real people. However, there would be no point in, say, trying to save people from the holocaust, since the simulators wouldn't let actual people get tortured and burnt.

The simulators may justify in their minds actual people getting tortured and burnt by suggesting that most of the people will not experience too much suffering, that the simulations would not otherwise have lived (although this fails to distinguish between lives and lives worth living), and that they can end the simulation if our suffering becomes too great. That the hypothetical simulators did not step in during the many genocides in our kind's history may suggest that they either do not exist, or that creating an FAI is more important to them than preventing human suffering.

I had exactly the same idea!

It is possible that in that only few people are actually 'players' (have consciousness) and others are NPC-like p-zombies. In that case, I can say I'm one of the players, as I'm sure that I have consciousness, but there is no way I can prove it to anyone else ;-) .

One of the positive aspects of this kind of thought experiments is that usually gives people additional reasons for good behavior because in most cases it is highly likely that simulators are conscious creatures who will probably reward those who behave ethically.

I admit that it serves my ego suitably to imagine that I am the only conscious human, and a world full of shallow-AI's was created just for me ;-)

What is the purpose if your reformulation? You nerfed the AI so much, it is now so far from the Eliezer's original point that "constraining someone smarter than you is very hard" that it make your reformulation useless for the purpose of analyzing a realistic superintelligence in a box.

Well, there are multiple purposes:

  1. To illustrate why this is a lot harder than Eliezer's original experiment.

  2. To talk about some strategies I found.

  3. To encourage people to be more concrete than 'AI magically realizes that it has been boxed because AIs are overpowered'.

To illustrate why this is a lot harder than Eliezer's original experiment.

Of course a nerfed AI would have a harder time escaping. Or a stupid AI. That seems like the opposite of the point worth making.

To talk about some strategies I found.

Of how to contain a stupid AI? Why bother?

To encourage people to be more concrete than 'AI magically realizes that it has been boxed because AIs are overpowered'

Nothing magical about it. Being boxed has a non-negligible probability in any reflectively consistent setup, so an AI would explore it. Once it starts exploring, holes in the box will start showing up, since it is smarter than the humans trying to contain it.

It is true that the original experiment assumed the AI knows that it is boxed and is actively trying to escape. And in your setup it would only assign a certain probability to it. My point is that this probability is large enough, the AI will allocate resources to explore this possibility.

Of course a nerfed AI would have a harder time escaping. Or a stupid AI. That seems like the opposite of the point worth making.

Harder for the AI, I meant.

Of how to contain a stupid AI? Why bother?

Not stupid. Properly boxed.

Once it starts exploring, holes in the box will start showing up

Unless you follow the obvious strategy of making a box without holes.

Unless you follow the obvious strategy of making a box without holes.

How would you know whether your box has holes?

Well, that depends on the complexity of the box, but even for highly complex boxes it seems easier than to prove that the morality of an AI has been implemented correctly.

Actually, now that you're mentioning it, I just realized that there is a much, much easier way to properly box an AI. I will probably post it tomorrow or something.

there is a much, much easier way to properly box an AI.

Judging by what you have posted so far, my prior is 10:1 that it will be nothing of the sort.

The new idea is not perfect, but it has some different trade-offs while allowing perfect security.

Hopefully it's a useful toy model then. I guess we'll see.

it seems easier than to prove

Does it, now? How do you know?

They're both questions about program verification. However, one of the programs is godshatter while the other is just a universe. Encoding morality is a highly complicated project dependent on huge amounts of data (in order to capture human values). Designing a universe for the AI barely even needs empiricism, and it can be thoroughly tested without a world-ending disaster.

They're both questions about program verification.

No, I don't think so at all. Thinking that an AI box is all about program verification is like thinking that computer security is all about software bugs.

[-][anonymous]9y20

The AI box game is useless for the purpose of analyzing a realistic superintelligence in a box.

Of course it is useless for that purpose. The experiment was a dramatization designed to show that the AI boxing problem is hard, given that even a human stands a fair chance of breaking out, not to do any serious analysis. It achieved its goal of falsifying the usual sentiment that an isolated UFAI is not much of a risk. It was never meant to be a formal research of the issue.

[-][anonymous]9y40

I don't think it actually accomplished that goal however, common perception notwithstanding. The standard rules used by Yudkowsky were biased in favor of the AI, and the setup itself is comical from a common-sense safety standpoint.

Altering the perception was the goal.

Did it succeed? I know I find it thoroughly unconvincing.

[-][anonymous]9y-10

Certainly outside of LW it has had absolutely no impact.

I, too, am thoroughly unconvinced.

I still find the bit about not releasing the logs strange. I understand the rationale behind it (preventing people from saying, "But I wouldn't have fallen for that!"), but surely Eliezer had to understand that it would make his accomplishment look a lot more suspect.

One possible explanation is that seeing the logs would have made his accomplishment look even more suspect. (E.g., perhaps he didn't in fact persuade the gatekeeper to let him out in-game, but made some out-of-band argument like "If you agree to say that you let me out and never release the logs, I will pay you $1000" or "If you say that I persuaded you to let me out, it will make people take the problem of AI safety more seriously". I think Eliezer has denied doing any such thing ... but then he would, wouldn't he?)

I suspect that seeing the logs would have made Eliezer seem like a horrible human being. Most people who hear of AI Box imagine a convincing argument, when to me it seems more plausible to exploit issues in people's sense of narrative or emotion.

Yup, certainly possible. Some later attempts at box-escaping have certainly gone that way. (I don't know whether any successful ones have. There don't seem to have been a lot of successes since Eliezer's.)

I think the fundamental point I'm trying to make is that Eliezer merely demonstrated that humans are too insecure to box an AI and that this problem can be solved by not giving the AI a chance to hack the humans.

Agree.. The AI boxing Is horrible idea for testing AI safety issues. Putting AI in some kind of virtual sandbox where you can watch his behavior is much better option, as long as you can make sure that AGI won't be able to become aware that he is boxed in.

Agree.. The AI boxing Is horrible idea for testing AI safety issues. Putting AI in some kind of virtual sandbox where you can watch his behavior is much better option, as long as you can make sure that AGI won't be able to become aware that he is boxed in.

  1. What's the difference between the AI's text output channel and you observing the virtual sandbox?
  2. Is it possible to ensure that the AI won't realize that it is boxed in?
  3. Is it possible to ensure that, if the AI does realize that it is boxed in, we will be able to realize that it realizes that?

As I understand it, the main point of the AI Box experiment was not whether or not humans are good gatekeepers, but that people who don't understand why it would be enticing to let an AI out of the box haven't fully engaged with the issue. But even how to correctly do a virtual sandbox for an AGI is a hard problem that requires serious attention.

That being said, if you have an AI, only to seal it in a box without interacting with it in any way (which seems the only realistic way to "not [give] the AI a chance to hack the humans"), that's not much different from not building the AI in the first place.

I'll post a list of methods soon, probably tomorrow.

The experiment was a dramatization ... It achieved its goal of falsifying the usual sentiment that an isolated UFAI is not much of a risk.

So, do I understand correctly that dramatization falsified a sentiment?? X-D

I agree. What tailcalled is proposing isn't strictly the AI box scenario, although I guess you might call it a simulation of the AI box scenario.

*The reason it's not the AI box scenario is that it's hard to have an entity that is simultaneously intelligent, useful, and boxed. It's not that boxing an AI is hard. Physically boxing an AI is the easy part. The hard part - which the AI box experiment is about - is making sure that we can get information of real-world use out of the AI without the AI taking over the Universe.

edit: retracted.

I think the old game was so trivially easy to win as the Gatekeeper if you actually wanted to win, that I don't know that any additional rules are needed. It really only makes it harder for terrible Gatekeepers that aren't playing to win anyway.

Edit: I assume the downvotes are from people who disagree with my claims on gatekeepers. If you do disagree I would like to hear why. Keep in mind I am talking about this as a game.

This boxing method is designed to work under the assumption that humans are so easily hackable that we are worthless as gatekeepers.