Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.
Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.
However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.
Pick or design a game that contains some aspect of reality that you care about in terms of AI. All games have some element of learning, a lot have an element of planning and some even have varying degrees of programming.
As an example, I will pick Factorio, a game that involves learning, planning and logistics. Wire up the AI to this game, with appropriate reward channels etc. etc.. Now you can test how good the AI is at getting stuff done; producing goods, killing aliens (which isn't morally problematic, as the aliens don't act as personlike morally relevant things) and generally learning about the universe.
The step with morality depends on how the AI is designed. If it's designed to use heuristics to identify a group of entities as humans and help them, you might get away with throwing it in a procedurally generated RPG. If it uses more general, actually morally relevant criteria (such as intelligence, self-awareness, etc.), you might need a very different setup.
However, speculating at exactly what setup is needed for testing morality is probably very unproductive until we decide how we're actually going to implement morality.