Yoshua Bengio writes[1]:
nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans
I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.
Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.
There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".
Whoever builds this AGI will choose what X will be[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.
Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".
Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.
Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.
Will it manipulate its creators into giving it rewards? No. There are no "rewards".
Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.
Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.
- ^
- ^
I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.
- ^
Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.
It's not a gotcha, I just really genuinely don't get how the model you are explaining doesn't just collapse into nothingness.
Like, you currently clearly think that some of your preferences are more stable under reflection. And you have guesses and preferences over the type of reflection that makes your preferences better by your own lights. So seems like you want to apply one to the other. Doing that intellectual labor is the core of CEV.
If you really have no meta level preferences (though I have no idea what that would mean since it's part of everyday life to balance and decide between conflicting desires) then CEV outputs something at least as coherent as you are right now, which is plenty coherent given that you probably acquire resources and have goals. My guess is you can do a bunch better. But I don't see any way for CEV to collapse into nothingness. It seems like it has to output something at least as coherent as you are now.
So when you say "there is no coherence" that just seems blatantly contradicted by you standing before me and having coherent preferences, and not wanting to collapse into a puddle of incoherence.