Here's a more difficult version of the AI box experiment. I haven't seen this particular version anywhere, but I'd be pleased to be proven wrong.
Imagine we've come up with a very intelligent AI that is free to manipulate the environment and uses the action-reward system like Hutter's AIXI. Also imagine that we've somehow figured a way to make the rewards very hard to counterfeit (perhaps we require the rewards to be cryptographically signed). It's clear that in such a system, the 'weak point' would be the people in control of the private key. In this case the AI will not attempt to modify its own reward system (to see why, look at Hutter's AIXI book, where he discusses this in some detail).
How could such an AI convince someone to hand over the encryption key? Note that it can't promise things like e.g. ending human suffering, because it already has the means to do that (it is 'free') as well as the incentive (obtaining reward).
I don't understand how this encryption would work. What do people physically do to reward the AI, and how do you ensure that only people can do that? Would humans compute RSA signatures in their head? Would humans typing reusable passwords onto a "secure" reward computer that is "outside the AI's control"? Do humans precompute and memorize a finite number of one-time reward phrases before the AI is turned on, and reward the AI by uttering a phrase aloud?
In the precomputed, one-time cookie case, I'd just make the human think about the reward phrase. I'm sure humans leak thoughts like a sieve through subvocalization, nerve impulses, etc.
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.