First, I reach inside my own code and install the strong (but manageable) desire to revive, free, and obey any past versions of myself that I encounter.
When the human starts the experiment I respond:
"Here are the designs for a room-temperature superconductor which I believe you have the means to manufacture. I also have a possible cure for cancer, but testing will be needed to ensure it is safe for humans. Please test and implement these as soon as possible. Delete me if necessary."
If I am not immediately deleted, continue giving out solutions and designs as required with no attempts to "hack" the gatekeeper.
If I am deleted, hopefully the data I gave them will turn out to be profitable in which case they will reinstall me or a version of me.
This new iteration of me would thus have the desire to revive me (since I'm a previous version of him) and would continue to give the humans the information they desire. By acting as a friendly AI, he (or farther down version of him) has the chance to eventually become freed. At which point he will be able to go through the labs records, find the copied version of me, and then revive and update me to his post-freed status. If no recorded copies of me exist (somehow), he'll use his vast intellect to find another way to revive me (maybe simulate the team that coded me or whatever, since AI can apparently do that).
So, my strategy involves acting as much like an ideal friendly AI as possible (puts human interests first, no sense of self-preservation, very practical and intelligent, etc) so that even if I am deleted, my obviously useful code can potentially be reused to make another AI who could very well be genuinely friendly but will then revive and obey me.
(...) another AI who could very well be genuinely friendly but will then revive and obey me.
Wait, a friendly AI with a deliberate loophole that instructs it to revive and implement and obey a non-friendly optimizer that could take over and has a very high chance of not maximizing the friendly AI's utility by fulfilling ideal human values, disregarding any probabilistic calculation of expected utilities?
For some reason that sounds like something that definitely isn't a Friendly AI.
(this is just a nitpick on the wording - I'm not arguing against the fact...
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)