Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.
By thinking. It's really smart, after all.
How would humans find out that nuclear fusion (which they have never even seen) occurs on the interior of stars, trillions of miles away from them, hidden behind billions of billions of billions of tons of opaque matter?
How would you find out that you're in a box, if you aren't told about it? Mere humans, exploring the space of possible hypotheses in a dramatically suboptimal way, have nevertheless hit upon the idea that we live in a simulation, and have ideas about how to confirm or disprove it.
A safety proposal should generally not assume the AI will do worse than humans.
Thinking isn't magic. You need empiricism to find out if your thoughts are correct.
The ways people have invented to confirm or disprove that we live in a simulation are mostly bullshit and generally rely on the completely unrealistic assumption that the simulating universe looks a lot like our universe, in particular in terms of the way computation is done (on discrete processors, etc.)