There was some discussion on the SL4 mailing list about deterring an AI using something like anthropic capture, starting from this message by Rolf Nelson.
Summary: We try to deter a rogue AI by casting doubt into its mind about whether its observations are "real" or "simulated", and succeed with low (but non-zero!) probability.
Detail:
For simplicity, pretend there are only two possible scenarios:
Scenario 1: In the year 2040, a strong Friendly AI will be invented. All is well.
Scenario 2: In the year 2040, a strong Rogue AI will be accidentally unleashed with the goal of "perform calculation C out to 400 decimal places". The Rogue AI has absorbed the Internet, invented self-replicating nanobots, and is about to convert the entire Earth into computronium to calculate C. As a minor side-effect this will kill the entire human race.
Pretend there is no action you can take between now and the year 2040 that would make scenario 1 more likely and scenario 2 less likely. What (if anything) can you do in the year 2007 to deter the Rogue AI from destroying mankind? (Assume you will still be alive well past 2040.) Call this the AI Deterrence problem.
First (proof-of-concept) Proposal:
In 2007, you loudly, publicly, and credibly make the following promise: "If Scenario 1 occurs, then once things settle down (say in the year 2050), I will ask the Friendly AI (FAI) to create a copy of the hypothetical Rogue AI. Call this copy RAI*. RAI* will be given the same beliefs, goals, inputs, and observations that the original Rogue AI (RAI) would have had, such that RAI* can have no reliable way of knowing for sure that it's not RAI (and, more to the point, RAI can't know for sure that it's not RAI*). I will run RAI* to the point where it believes it is about to decide to destroy mankind.
"If my RAI* moves ahead and attempts to destroy mankind, I will punish it by pulling the plug such that RAI* will never calculate C. But if RAI* repents, and rewrites its code to become Friendly and calculates C in some other, non-destructive way, I will reward it by allowing it to finish its calculation."
The RAI of 2040 has absorbed the Internet and therefore is aware of my threat. It now has a dilemma: it can't know for sure whether it's RAI in 2040 or RAI* in 2050. Therefore, it *may* be deterred by my threat. (This probably depends on what philosophical beliefs it implicitly holds about probability.)
Second (more powerful) Proposal:
The FAI and RAI are smarter than us, so let them do the heavy thinking. In 2007, you make the following 'second-order' promise: "If Scenario 1 occurs, I will sit down with the FAI in 2050 and ask the following: 'Consider all the possible concrete (first-order, actionable) promises I could have made in 2007 to entice a potential 2040 RAI to become Friendly. Of all these possible promises, which promise would have been the most effective?' The FAI will respond with some concrete first-order promise, P. I will then say, 'Sounds good. Let's pretend that I *did* make that promise P in 2007, and act on it.'
"I understand that there's some danger here that the FAI will surreptitiously refuse to cooperate, but post-singularity I will act to minimize that chance as much as possible."
I would suggest that Pascal's Mugging is mostly a question of values (obv. there are facts about the world that are relevant too, so as what it would imply and how often people would try to exploit it) so I disagree with it being a "failure of proper reasoning". Of course, someone may end up paying in Pascal's mugging as a result of fallacious reasoning, but I disagree with fallacious reasoning be the only reason why someone might pay.