Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)
If the gatekeepers have such a high prior that the AI is hostile, why are we even letting it talk? What are we expecting to learn from such a conversation?
"Yep, it's undergone a constrained FOOM, several of our best interrogators were utterly pwned and had to be tranquilised"
I tend to assume the AI is such that it can be cloned and put in to stasis. Thus, you put the AI in to stasis, clone it, and run it through this test a few times. You now have a bit more information on what it will try, and how it thinks. Destroying a copy of a program is an action humans find trivial, since we do it every time we close our web browser. Since we have no clue HOW convincing the AI is, letting it send just a single message is probably useful.
Of course, if it were me setting this up, I'd lie to the gatekeeper and not actually give them the ab... (read more)