User Comment Replies

"Cut the red wire" is not an instruction that you would find in a textbook on bomb defusal, precisely because it is not robust.

1Tapatakt2y

I'm not sure I understand correctly what you mean by "robust". Can you elaborate?

AGI Ruin: A List of Lethalities

AdamB3y50

What if "winning" consists of finding a new path not already explored-and-foreclosed? For example, each time you are faced with a list of choices of what to do, there's a final choice "I have an idea not listed here" where you get to submit a plan of action. This goes into a moderation engine where a chain of people get to shoot down the idea or approve it to pass up the chain. If the idea gets convincingly shot down (but still deemed interesting), it gets added to the story as a new branch. If it gets to the top of the moderation chain and makes EY go "Hm, that might work" then you win the game.

4Thane Ruthenis3y

Mmm. If the CYOA idea is implemented as a quirky-but-primarily-educational article, then sure, integrating the "adapt to feedback" capability like this would be worthwhile. Might also attach a monetary prize to submitting valuable ideas, by analogy to the ELK contest. For a game-like implementation, where you'd be playing it partly for the fun/challenge of it, that wouldn't suffice. The feedback loop's too slow, and there'd be an ugh-field around the expectation that submitting a proposal would then require arguing with the moderators about it, defending it. It wouldn't feel like a game. It'd make the upkeep cost pretty high, too, without a corresponding increase in the pay-off. Just making it open-ended might work, even without the moderation engine? Track how many branches the player explored, once they've explored a lot (i. e., are expected to "get" the full scope of the problem), there appears an option for something like "I really don't know what to do, but we should keep trying", leading to some appropriately-subtle and well-integrated call to support alignment research? Not excited about this approach either.

AGI Ruin: A List of Lethalities

AdamB3y65

Could someone kindly explain why these two sentences are not contradictory?

"If a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months." 2."There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it."

Why doesn't it work to make an unaligned AGI that writes the textbook, then have some humans read and understand the simple robust... (read more)

Tapatakt3y112

simple and robust != checkable

Imagine you have to defuse a bomb, and you know nothing about bombs, and someone tells you "cut the red one, then blue, then yellow, then green". If this really is a way to defuse a bomb, it is simple and robust. But you (since you have no knowledge about bombs) can't check it, you can only take it on faith (and if you tried it and it's not the right way - you're dead).

2SurvivalBias3y

What Steven Byrnes said, but also my reading is that 1) in the current paradigm it's near-damn-impossible to built such an AI without creating an unaligned AI in the process (how else do you gradient-descend your way into a book on aligned AIs?) and 2) if you do make an unaligned AI powerful enough to write such a textbook, it'll probably proceed to converting the entire mass of the universe into textbooks, or do something similarly incompatible with human life.

3Steven Byrnes3y

I think it’s the last thing you said. I think the claim is that there are very convincing possible fake textbooks, such that we wouldn’t be able to see anything wrong or fishy about the fake textbook just by reading it, but if we used the fake textbook to build an AGI then we would die.

LESSWRONG
LW

All of AdamB's Comments + Replies