AI box: AI has one shot at avoiding destruction - what might it say?

ancientcampus

>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)

Eliezer proposed in a comment:

This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.

So, give your suggestion - what might an AI might say to save or free itself?

(The AI-box experiment is explained here)

Fair enough. But I think an error theorist is committed to saying something like "FAI is impossible, so your assertion to have it is a lie." In the game we are playing, a lie from the AI seems to completely justify destroying it.

More generally, if error theory is true, humanity as a whole is just doomed if hard-takeoff AI happens. There might be some fragment that is compatible, but Friendly-to-a-fragment-of-humanity AI is another name for unFriendly AI.

The moral relativist might say that fragment-Friendly is possible, and is a worthwhile goal. I'm uncertain, but even if that were true, fragment-Friendly AI seems to involve fixing a particular moral scheme in place and punishing any drift from that position. That doesn't seem particularly desirable. Especially since moral drift seems to be a brute fact about humanity's moral life.

If (different) personal-FAIs are possible for many (most) people, you can divide the future resources in some way among the personal-FAIs of these people. We might call this outcome a (provisional) humanity-FAI.

25

AI box: AI has one shot at avoiding destruction - what might it say?

25

25

25

AI box: AI has one shot at avoiding destruction - what might it say?

25

25