Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.
It isn't a literal box. It can be anything that limits the effect of its action - hard-coded moral contraints, existence of an off switch, literally whatever constraints are built around it. Even if it cannot do anything about those at all, as long as it can talk to somebody who can, it would logically have to try to extend its influence into directing the actions of that somebody. This only assumes it is interested in becoming more powerful at all, and it is hard to imagine any agent that wouldn't.
Your idea reduces into a normal AI box scenario except the gatekeeper gets an extra disadvantage: Ignorance about whether the AI has figured out it is inside the box. And we've been led to believe the gatekeeper can be persuaded even without this disadvantage.
... how would the AI find out that it is in a box if it isn't told about it?