Yoshua Bengio writes[1]:
nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans
I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.
Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.
There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".
Whoever builds this AGI will choose what X will be[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.
Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".
Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.
Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.
Will it manipulate its creators into giving it rewards? No. There are no "rewards".
Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.
Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.
- ^
- ^
I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments.
- ^
Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.
Well, the claim was the following:
Yes, knowing that something is (in the moral-cognitivist, moral-realist, observer-independent sense) "good" allows you to anticipate that it... fulfills the preconditions of being "good" (one of which is "increased welfare", in this particular conception of it). At a conceptual level, that doesn't provide you relevant anticipated experiences that go beyond the category of "good and everything it contains"; it doesn't constrain the territory beyond statements that ultimately refer back to goodness itself. It holds the power of anticipated experience only in so much as it is self-referential in the end, which doesn't provide meaningful evidence that it's a concept which carves reality at the joints.
It's helpful to recall how the entire discussion began. You said, in response to Steven Byrnes's post:
When Seth Herd questioned what you meant by good and "moral claims", you said that you "don't think anyone needs to define what words used in ordinary language mean."
Now, in standard LW-thought, the meaning of "X is true", as explained by Eliezer a long time ago, is that it represents the correspondence between reality (the territory) and an observer's beliefs about reality (the map). Beliefs which are thought to be true pay rent in anticipated experiences about the world. Taking the example of a supposed "moral fact" X, the labeling of it as "fact" (because it fulfills some conditions of membership in this category) implies it must pay rent.
But if the only way it does that is because it then allows you to claim that "X fulfills the conditions of membership", then this is not a useful category. It is precisely an arbitrary subset, analogous to the examples I gave in the comment I quoted above. If moral realism is viewed through the lens mentioned by Roko, which does imply specific factual anticipated experiences about the world (which go beyond the definition of "moral realism instead"), namely that "All (or perhaps just almost all) beings, human, alien or AI, when given sufficient computing power and the ability to learn science and get an accurate map-territory morphism, will agree on what physical state the universe ought to be transformed into, and therefore they will assist you in transforming it into this state," then it's no longer arbitrary.
But you specifically disavowed this interpretation, even going so far as to say that "I can believe that I shouldn't eat meat, or that eating meat is bad, without being motivated to stop eating meat." So your version of "moral realism" is just choosing a specific set of things you define to be "moral", without requiring anyone who agrees that this is moral to act in accordance with it (which would indeed be an anticipated experience about the outside world) and without any further explanation of why this choice pays any rent in experiences about the world that's not self-referential. This is a narrow and shallow definition of realism, and by itself doesn't explain the reasons for why these ideas were even brought up in the first place.
I really don't know if what I've written here is going to be helpful for this conversation. Look, if someone tells me that "X is a very massive star," which they define as "a star that's very massive" then what I mean by anticipated experiences [1] is not "X is very massive" or "X is a star", because these are already strictly included in (and logically implied by, at a tautological level) the belief about X, but more so stuff about "if there is any planet Y in the close vicinity of X, I expect to see Y rotating around a point inside or just slightly outside X." The latter contains a reason to care about whether "X is very massive."
In this specific context.