The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world
What it means for "the AI to be in the box" is generally that the AI's impacts on the outside world are filtered through the informed consent of the human gatekeepers.
An AI that wants to not impact the outside world will shut itself down. An AI that wants to only impact the outside world in a way filtered through the informed consent of its gatekeepers is probably a full friendly AI, because it understands both its gatekeepers and the concept of informed consent. An AI that simply wants its 'box' to remain functional, but is free to impact the rest of the world, is like a brain that wants to stay within a skull- that is hardly a material limitation on the rest of its behavior!
I think you misunderstand what I mean by proposing that the AI wants to stay inside the box. I mean that the AI wouldn't want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
Here's the new thread for posting quotes, with the usual rules: