What if "winning" consists of finding a new path not already explored-and-foreclosed? For example, each time you are faced with a list of choices of what to do, there's a final choice "I have an idea not listed here" where you get to submit a plan of action. This goes into a moderation engine where a chain of people get to shoot down the idea or approve it to pass up the chain. If the idea gets convincingly shot down (but still deemed interesting), it gets added to the story as a new branch. If it gets to the top of the moderation chain and makes EY go "Hm, that might work" then you win the game.
Could someone kindly explain why these two sentences are not contradictory?
Why doesn't it work to make an unaligned AGI that writes the textbook, then have some humans read and understand the simple robust ideas, and then build a new aligned AGI with those ideas? If the ideas are actually simple and robust, it should be impossible to tunnel a harmful AGI through them, right? If the unaligned AI refuses to write the textbook, couldn't we just delete it and try again? Or is the claim that we wouldn't be able to distinguish between this textbook and a piece of world-ending mind-control? (Apologies if this is answered elsewhere in the literature; it's hard to search.)
"Cut the red wire" is not an instruction that you would find in a textbook on bomb defusal, precisely because it is not robust.