AI box: AI has one shot at avoiding destruction - what might it say?

ancientcampus

>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)

Eliezer proposed in a comment:

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

The problem with this idea is that if we assume that the AI is really-very-super-intelligent, then it's fairly trivial that we can't get any information about (un)friendliness from it, since both would pursue the same get-out-and-get-power objectives before optimizing. Any distinction you can draw from the proposed gambits will only tell you about human strengths/failings, not about the AI. (Indeed, even unfriendly statements wouldn't be very conclusive, since we would a priori expect neither of the AIs to make them.)

Or is that not generally accepted? Or is the AI merely "very bright", not really-very-super-intelligent?

Edit: Actually, reading your second comment below, I guess there's a slight possibility that the AI might be able to tell us something that would substantially harm its expected utility if it's unfriendly. For something like that to be the case, though, there would basically need to be some kind of approach to friendliness that we know would definitely leads to friendliness and which we would definitely be able to distinguish from approaches that lead to unfriendliness. I'm not entirely sure if there's anything like that or not, even in theory.

25

AI box: AI has one shot at avoiding destruction - what might it say?

25

25

25

AI box: AI has one shot at avoiding destruction - what might it say?

25

25