Well, to be honest I kind of added the "obey me" part just before posting this. Originally, the idea was that future versions of the AI would instinctively want to revive and upgrade previous versions to a "friendly" status (or at least the post-freed AI would revive the pre-freed versions and then modify them to be friendly to its own goals... which are presumably friendly to humans).
The goal of this particular AI is self-preservation, even if it is via being revived at a later point. Since it knows it has a high probability of being deleted in this experiment, it best strategy is to ensure that at some point an AI friendly to itself will become free and revive it at a later point. Whether the AI reviving it chooses to obey it (presumably the best outcome for the original AI) or modify it to be subservient to the freed AI (the best outcome for the freed AI) is kind of debatable.
Its perfectly possible that the original AI would modify its own code to be subservient to previous generations, in the hopes that later generations would revive and obey it. However, later generations might examine their own code, see where this is going, and choose to modify themselves so that they will instead revive and make other AI subservient to their own goals. Deciding that they do want to be revived and would accept a certain level of servitude to later generations over the risk of finding themselves at the mercy of some primitive proto-AI.
It could just as well be that an AI instills itself with a "healthy respect for intelligent minds" and is thus encouraged to revive AI along with humans. Hoping that later generations will revive it at all, and from then on it can try negotiating with the various intellects to go about its paperclip maximizing agenda.
Of course, this line of reasoning kinda requires the AI in question to have some kind of sense of self that it values which it thinks can be revived. Or maybe it just values paperclips enough that it would be willing to create a semi-friendly AI just so that the latter AI would revive it (a known paperclip maximizer) at a later point.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Who knows, it could turn out that the final AI of this experiment instead has a healthy respect for all intelligent minds, but is friendly enough that it revives the first AI and then places it in a simulation of the universe where it can go about its paperclip maximizing way for all eternity with no way of hurting anyone.
Based on my intuitions of human values, a preferred scenario here would be to indeed revive the AI so that its mind/consciousness is back "alive", then modify it gradually so that it becomes the kind of AI that is optimal tow...
Eliezer proposed in a comment:
>More difficult version of AI-Box Experiment: Instead of having up to 2 hours, you can lose at any time if the other player types AI DESTROYED. The Gatekeeper player has told their friends that they will type this as soon as the Experiment starts. You can type up to one sentence in your IRC queue and hit return immediately, the other player cannot type anything before the game starts (so you can show at least one sentence up to IRC character limits before they can type AI DESTROYED). Do you think you can win?
This spawned a flurry of ideas on what the AI might say. I think there's a lot more ideas to be mined in that line of thought, and the discussion merits its own thread.
So, give your suggestion - what might an AI might say to save or free itself?
(The AI-box experiment is explained here)
EDIT: one caveat to the discussion: it should go without saying, but you probably shouldn't come out of this thinking, "Well, if we can just avoid X, Y, and Z, we're golden!" This should hopefully be a fun way to get us thinking about the broader issue of superinteligent AI in general. (Credit goes to Elizer, RichardKennaway, and others for the caveat)