Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)
All of what Desrtopa said, but also, "hacking me" isn't evidence of friendliness.
I don't have any reason to assume that any given hack attempt is more likely to come from a FAI, so I can assign, at best, 50/50 odds that any AI trying to hack me is unfriendly. I do not want to release any AI which has a 50% chance of being unfriendly. Therefor, I do not want to be hacked.
I also suspect that 50% chance of being friendly is generous, but that's more of a gut intuition.
I think this is a bad use of probabilities. If a friendly and an unfriendly AI are equally likely to hack you in this scenario, then knowledge that they tried to hack you shouldn't modify your estimated probability about the friendliness of the AI -- it provides no evidence one way or another, because both options were equally likely to show such behaviour.
e.g. if your prior P(UFAI) = 0.01 (1% chance of unfriendliness), and you estimate P(hack|UFAI) = 70% (a UFAI has a 70% chance to try to hack) and P(hack|FAI) = 70% also, then the posterior
P(UFAI|hack) = P(hack|UFAI) P(UFAI) / P(hack) = 0.7 0.01 / 0.7 = 0.01 still...