Ok, I take it by "one-way-blind" you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI's answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals "ok".
Ideally the AI […] knows as little as possible about humans and about our universe's physics.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
Update 2013-09-05.
I have since played two more AI box experiments after this one, winning both.
Update 2013-12-30:
I have lost two more AI box experiments, and won two more. Current Record is 3 Wins, 3 Losses.