Yoshua Bengio writes[1]:
nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans
I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.
Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.
There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".
Whoever builds this AGI will choose what X will be[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.
Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".
Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.
Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.
Will it manipulate its creators into giving it rewards? No. There are no "rewards".
Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.
Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.
- ^
- ^
I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.
- ^
Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.
I think you misunderstood what I meant by "collapse to nothingness". I wasn't referring to you collapsing into nothingness under CEV. I meant your logical argument outputting a contradiction (where the contradiction would be that you prefer to have no preferences right now).
The thing I am saying is that I am pretty confident you don't have meta preferences that when propagated will cause you to stop wanting things, because like, I think it's just really obvious to both of us that wanting things is good. So in as much as that is a preference, you'll take it into account in a reasonable CEV set up.
We clearly both agree that there are ways to scale you up that are better or worse by your values. CEV is the process of doing our best to choose the better ways. We probably won't find the very best way, but there are clearly ways through reflection space that are better than others and that we endorse more going down.
You might stop earlier than I do, or might end up in a different place, but that doesn't change the validity of the process that much, and clearly doesn't result in you suddenly having no wants or preferences anymore (because why would you want that, and if you are worried about that, you can just make a hard commit at the beginning to never change in ways that causes that).
And yeah, maybe some reflection process will cause us to realize that actually everything is meaningless in a way that I would genuinely endorse. That seems fine but it isn't something I need to weigh from my current vantage point. If it's true, nothing I do matters anyways, but also it honestly seems very unlikely because I just have a lot of things I care about and I don't see any good arguments that would cause me to stop caring about them.