(Response to this challenge)
I've read two things recently on similar strategies: Zvi's post on ChatGPT, and Scott Alexander's post on Redwood Research.
They both seem to have a similar strategy, to train AI to not do misaligned things by giving it a bunch of examples of misaligned things, and saying don't do that.
They both fail to out of training distribution errors. ChatGPT with asking it to pretend, or even just to speak in uwu voice, and Redwood with poetic descriptions, plus weirder things.
The Redwood post shows that they thought about this problem. They have a step for "find out of distribution errors, and train more, especially on them." Plus an idea that this... (read 198 more words →)
I think I agree with a vibe I see in the comments that an AI that causes this problem is perhaps threading a very small needle.
Yudkowsky wrote The Hidden Complexity of Wishes to explain that a genie that does what you say will almost certainly cause problems. If people have this kind of Superintelligence, it won't take long before someone asks it to get them as many paperclips as possible and we all die. The kind of AI that does what humans want without killing everyone is one that does what we mean.
But how does this work? If you ask such a superintelligence to pull your kid from the rubble of a... (read more)