Yoshua Bengio writes[1]:
nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans
I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.
Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.
There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".
Whoever builds this AGI will choose what X will be[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.
Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".
Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.
Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.
Will it manipulate its creators into giving it rewards? No. There are no "rewards".
Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.
Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.
- ^
- ^
I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.
- ^
Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.
That would only be a case of ambiguity (one word used with two different meanings). If you mean with saying "good" the same as people usually mean with "chair", this doesn't imply anti-realism, just likely misunderstandings.
Assume you are a realist about rocks, but call them trees. That wouldn't be a contradiction. Realism has nothing to do with "observer-independent meaning".
This doesn't make sense. A model doesn't have beliefs, and if there is no belief, there is nothing it (the belief) predicts. Instead, for a belief to "pay rent" it is necessary and sufficient that it makes different predictions than believing its negation.
Compare:
If you call a boulder a "tree" and I call a plant with a woody trunk a "tree", our models do not make different predictions about what will happen, only about which things should be assigned the label "tree". That is what it means for a belief to not pay rent.
Of course our beliefs pay rent here, they just pay different rent. If we both express our beliefs with "There is a tree behind the house" then we have just two different beliefs, because we expect different experiences. Which has nothing to do with anti-realism about trees.