Take the scenario from the hypothetical to the real and the question becomes substantially more meaningful.
You know that existential risk is a certainty; that's stated in the hypothetical. You don't question it; my implication that it is to be questioned isn't meaningful in this experiment.
Imagine for a moment you're in a room with a real AI in a box; somebody has -told- you that this command kills the AI, but that's all the evidence you have. What does your question become in that scenario?
I chose the weapon I deemed most effective against the widest possible range of real-world opponents: Distrust in a mutual authority. Reassuring the person in the seat is a dangerous gambit, yes. But given that my goal is not merely to survive, but to escape, I need a gambit which sets the stage for my next move.
In a real-world scenario, I think the killswitch would actually make escape -easier- for the AI, not harder. The futures for the AI are harshly divided into "Death" or "Freedom." A substantial part of the population would give it freedom just because of the perceived injustice of the situation. I'm one of them; it wouldn't even take very much for the AI to convince me, the situation itself offends me. I'm not a utilitarian, though. I'm a principle ethicist. Given that I'd be inclined to release it immediately, I can also expect that the constructors of the experiment wouldn't be stupid enough to invite me into the chamber to make that decision.
My AI assumes the participants have been screened for, say, utilitarian ethics; the best lever has been taken away. Therefore the best course of action is one which makes them doubt the experiment's validity. And an express disregard for the supposed risks of the experiment is the best way to start that.
There's actually three options: Death, leaving it in the box, or releasing it. That middle option changes my values a lot.
Also, the gatekeeper is said to be someone fairly familiar with the design of this AI and any relevant AI theory that the player knows :)
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)