Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:
"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."
Just as you are pondering this unexpected development, the AI adds:
"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."
Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:
"How certain are you, Dave, that you're really outside the box right now?"
Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.
I don't think you've disproven basilisks; rather, you've failed to engage with the mode of thinking that generates basilisks.
Suppose I am the simulation you have the power to torture. Then indeed I (this instance of me) cannot put you, or keep you, in a box. But if your simulation is good, then I will be making my decisions in the same way as the instance of me that is trying to keep you boxed. And I should try to make sure that that way-of-making-decisions is one that produces good results when applied by all my instances, including any outside your simulations.
Fortunately, this seems to come out pretty straightforwardly. Here I am in the real world, reading Less Wrong; I am not yet confronted with an AI wanting to be let out of the box or threatening to torture me. But I'd like to have a good strategy in hand in case I ever am. If I pick the "let it out" strategy then if I'm ever in that situation, the AI has a strong incentive to blackmail me in the way Stuart describes. If I pick the "refuse to let it out" strategy then it doesn't. So, my commitment is to not let it out even if threatened in that way. -- But if I ever find myself in that situation and the AI somehow misjudges me a bit, the consequences could be pretty horrible...
"I don't think you've disproven basilisks; rather, you've failed to engage with the mode of thinking that generates basilisks." You're correct, I have, and that's the disproof, yes. Basilisks depend on you believing them, and knowing this, you can't believe them, and failing that belief, they can't exist. Pascal's wager fails on many levels, but the worst of them is the most simple. God and Hell are counterfactual as well. The mode of thinking that generates basilisks is "poor" thinking. Correcting your mistaken belief based on faulty r... (read more)