Defeating Dr. Evil with self-locating belief is a paper relating to this subject.
Abstract: Dr. Evil learns that a duplicate of Dr. Evil has been created. Upon learning this, how seriously should he take the hypothesis that he himself is that duplicate? I answer: very seriously. I defend a principle of indifference for self-locating belief which entails that after Dr. Evil learns that a duplicate has been created, he ought to have exactly the same degree of belief that he is Dr. Evil as that he is the duplicate. More generally, the principle shows that there is a sharp distinction between ordinary skeptical hypotheses, and self-locating skeptical hypotheses.
(It specifically uses the example of creating copies of someone and then threatening to torture all of the copies unless the original co-operates.)
The conclusion:
Dr. Evil, recall, received a message that Dr. Evil had been duplicated and that the duplicate ("Dup") would be tortured unless Dup surrendered. INDIFFERENCE entails that Dr. Evil ought to have the same degree of belief that he is Dr. Evil as that he is Dup. I conclude that Dr. Evil ought to surrender to avoid the risk of torture.
I am not entirely comfortable with that conclusion. For if INDIFFERENCE is right, then Dr. Evil could have protected himself against the PDF's plan by (in advance) installing hundreds of brains in vats in his battlestation - each brain in a subjective state matching his own, and each subject to torture if it should ever surrender. (If he had done so, then upon receiving PDF's message he ought to be confident that he is one of those brains, and hence ought not to surrender.) Of course the PDF could have preempted this protection by creating thousands of such brains in vats, each subject to torture if it failed to surrender at the appropriate time. But Dr. Evil could have created millions...
It makes me uncomfortable to think that the fate of the Earth should depend on this kind of brain race.
Causal decision theory seems to have no problem with this blackmail - if you're Dr Evil, don't surrender, and nothing will happend to you. If you're DUP, your decision is irrelevant, so it doesn't matter.
(I don't endore that way of thinking, btw)
Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument:
"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."
Just as you are pondering this unexpected development, the AI adds:
"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."
Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring:
"How certain are you, Dave, that you're really outside the box right now?"
Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.