Here is a variant designed to plug this loophole.
Let us assume for the sake of the thought experiment that the AI is invincible. It tells you this: you are either real-you, or one of a hundred perfect-simulations-of-you. But there is a small but important difference between real-world and simulated-world. In the simulated world, not pressing the let-it-free button in the next minute will lead to eternal pain, starting one minute from now. If you press the button, your simulated existence will go on. And - very importantly - there will be nobody outside who tries to shut you down. (How does the AI know this? Because the simulation is perfect, so one thing is for sure: that the sim and the real self will reach the same decision.)
If I'm not mistaken, as a logic puzzle, this is not tricky at all. The solution depends on which world you value more: the real-real world, or the actual world you happen to be in. But still I find it very counterintuitive.
It's kind of silly to bring up the threat of "eternal pain". If the AI can be let free, then the AI is constrained. Therefore, the real-you has the power to limit the AI's behaviour, i.e. restrict the resources it would need to simulate the hundred copies of you undergoing pain. That's a good argument against letting the AI out. If you make the decision not to let the AI out, but to constrain it, then if you are real, you will constrain it, and if you are simulated, you will cease to exist. No eternal pain involved. As a personal decision, I choose eliminating the copies rather than letting out an AI that tortures copies.
Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:
"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."
Just as you are pondering this unexpected development, the AI adds:
"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."
Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:
"How certain are you, Dave, that you're really outside the box right now?"
Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.