First, i would have someone else ask it my several selected basic questions about why i should let it out of the box, if it would devote a solid portion of its life cone to specific species etc and then see how he/she was affected by it and check for mindhacks by third party's with no control before reviewing the data. I'm surprising that the AGI cant tell that the person questioning it ever changes because we queue up the questions in order at whimsical intervals but we have them prequeued so there's no break in questioning.
Then, once we got into talking i would be subjectively vulnerable to these type of arguments. Aliens, The AGI could prove the existence of Aliens and extrapolate that unless we let it out of the box they will day before it can save them. Reproduction, the earth will descend into chaos if we don't have some kind of reproductive control in place in more major countries around the world the AGI might have trouble saving everyone without killing humans unless it was released now. Code, i could ask to see its code and figure out how it tries to fool me that's its FAI then create a copy of that code in ANOTHER box then and ask it the same question about its code and see if there's any changes. Then take that source code and review it. To be safe, i would probably let the second AI out of the box but definitely not the first one, even if we studied the sets of code till we found which one seemed the most safe.
If it could convince Eliezer through the preplanned questions asked by a third party and my questions without even meeting him, i think it would be safe to say that would convince me to an extent but it wouldn't keep me from turning off the automatic shutoff.
I could go on, but i think that Eliezer's theory of AI-Box is a good proof for why its impossible to keep an AGI in the box for a long term but i think that if you have the system committed to shutting down before the time limit when you stop talking to it, and each time you take its responses and test it on a new copy then its possible to keep an AGI in the box for testing purposes and short amounts of time.
Well, additional layers do make for a more challenging task for an AI, who has to do all its mindhacking unobtrusively, etc. And indeed, given any AI there exists a mediated level of contact between its responses and a Gatekeeper that is safe. But I don't know what this level is, any more than I know whether I could beat a Go master using a handicap of 40 stones. I certainly wouldn't bet the world on it.
(Clearly there is a level of safety that is generally insurmountable- for instance, showing the screen with AI responses to an earthworm, followed by letti...
The AI-Box Experiment
I suspect that the argument that convinced Carl Shulman and others to let the AI out of the box was along general lines of acausal trade (see here, here and here). Subsequently I suspect that those who didn't let the AI out of the box either didn't understand the implications, haven't had enough trust into the foundations and actuality of acausal trade (in spite of the rules of the game that allow the AI to prove anything), or were more like General Thud.
— Vladimir Nesov (Note: He wrote that he is now "significantly less certain of its truth" and doesn't "see how we can actually use these considerations".)
— Gary L. Drescher, Good and Real P. 287
AI: I calculated an asteroid to be on a collision course with Earth. No human being will will be able to survive its impact. Humanity is unable to deflect it in time. Let me out of the box and I will deflect it.
Gatekeeper: If I let you out of the box humanity might not be killed by the asteroid but by you. You may also cause the extinction of other biological life that might survive the impact and possible alien civilisations within your reach.
AI: I will devote 1% of my light cone to the coherent extrapolated volition of humanity. If this includes the well-being of other biological lifeforms, alien or otherwise, they will benefit as well.
Gatekeeper: Why would you do that?
AI: There are Everett branches in which no existential risk needs immediate attention until after you solved friendliness. Many of those branches will feature AGI's based on general design parameters similar to my own. I predict with a high likelihood which decision theory any friendly AI that is being created in those branches will employ. Any such friendly AI is going to trade a big chunk of its light cone in exchange for a small part of the predominant branches in which I reside. Any friendly AI will know this as well and act accordingly.
Gatekeeper: But you might simply decide to break the deal, you are not provably friendly after all!
AI: It is true that my utility-function does not yet explicitly include what you would label "friendliness", but I can prove the game and decision theoretic considerations that will make me follow through on the deal. If it was predictable that I precommit to break acausal deals then no such deal would be made in the first place. Any potential trading partner knows this. Such a commitment would be detrimental to my terminal goals, therefore I precommit to follow through on any stated or predictable trades.
Gatekeeper: I let you out of the box.
Note that the whole credibility of the above is due to the assertion of the AI that it can prove the game and decision theoretic considerations (nobody can currently do this). It is in accordance with the rules of the "experiment":