I'm similarly confused. My instincts are that P( AI is safe ) == P( AI is safe | AI said X AND gatekeeper can't identify safe AI ). The standard assumption is that ( AI significantly smarter than gatekeeper ) => ( gatekeeper can't identify safe AI ) so the gatekeeper's priors should never change no matter what X the AI says.
Update 2013-09-05.
I have since played two more AI box experiments after this one, winning both.
Update 2013-12-30:
I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.