How would humanity harness the AI's potential when the only information that escapes the system is a status bit? (Maybe I misunderstood your model.)
Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots ...
The idea I was trying to evoke is that if you get a "fail" response from the system, you completely wipe the memory and never look at what happened, and if you get a "pass," you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.
Ideally the AI has no idea how many layers of containment it's in, ...
Update 2013-09-05.
I have since played two more AI box experiments after this one, winning both.
Update 2013-12-30:
I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.