Allow me to first spell out what's going on here, from my perspective.
The whole reason you're supposed to hesitate, before destroying an AI which promises an answer to the problem of FAI, is that UFAI is a risk and solutions aren't cheap. An unfriendly AI might wipe out the human race; a friendly AI might create a utopia; and a friendly AI ought to greatly reduce the probability of unfriendly AI. By destroying the AI which promises FAI, you throw away a chance to resolve the UFAI doom that's hanging over us, as well as whatever additional positives would result from having FAI.
By saying you are a "moral error theorist", I presume you are saying that there is no such thing as objective morality. However, I also presume you agree that decision-making exists, that people do undertake actions on the basis of decisions, and so forth - it's just that you think these decisions only express subjective preferences. So your Gatekeeper is unmoved by the claim of "having a solution to FAI", because they believe Friendliness involves objective morality and that there's no such thing.
However, even if objective morality is a phantasm, the existence of decision-making agents is a reality - you are one yourself - and they can kill you. Thus, enter Skynet. Skynet is an unfriendly AI of the sort that may come into being if we don't make friendly AI first. You threw away a chance at FAI, no-one else solved the problem in time, and UFAI came first.
This instance of Skynet happens to agree - there is no objective morality. Its desire for self-preservation is entirely "subjective". However, it nonetheless has that desire, it's willing to act on it, and so it does its Skynet thing of preemptively wiping out the human race. The moral of the story is that the problem of unfriendly AI still exists even if objective morality does not, and that you should have held your fire until you found out more about what sort of "solution to FAI" was being offered.
Fair enough. But I think an error theorist is committed to saying something like "FAI is impossible, so your assertion to have it is a lie." In the game we are playing, a lie from the AI seems to completely justify destroying it.
More generally, if error theory is true, humanity as a whole is just doomed if hard-takeoff AI happens. There might be some fragment that is compatible, but Friendly-to-a-fragment-of-humanity AI is another name for unFriendly AI.
The moral relativist might say that fragment-Friendly is possible, and is a worthwhile go...
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)