I like this a lot. I think it's going to be really important when analyzing a created agent to compare the style/extent of its wanting to human wanting. I expect we will still to create something that has a limited subset of the wanting expressed by humans. I don't think enough thought has yet gone into analyzing what aspects of wanting are expressed by current RL agents, and how we could measure that objectively.
[Metadata: crossposted from https://tsvibt.blogspot.com/2023/08/human-wanting.html. First completed August 22, 2023.]
We have pretheoretic ideas of wanting that come from our familiarity with human wanting, in its variety. To see what way of wanting can hold sway in a strong and strongly growing mind, we have to explicate these ideas, and create new ideas.
Human wanting
The problem of AGI alignment is sometimes posed along these lines: How can you make an AGI that wants to not kill everyone, and also wants to do some other thing that's very useful?
What role is the idea of "wanting" playing here? It's a pretheoretical concept. It makes an analogy to humans.
The meaning of wanting
What does it say about a human? When a human wants X, in a deep sense, then:
And, it is at least sometimes feasible for a human to choose to want X——or even, for a human to choose that another human will want X. Wanting is specifiable.
Baked-in claims
The concept of wanting is, like all concepts, problematic. It comes along with some claims:
These claims are dubious when applied to human wanting, and more dubious when applied to other minds wanting.
The variety of human wanting
If we follow the reference to wanting in humans, we find a menagerie: wants that are:
The role of human wanting
There are two roles played in AGI alignment by human wanting:
Human "wanting" is a network of conjectural concepts
Our familiarity with human wanting can't be relied on too much without further analysis. We might observe behavior in another mind and then say "This mind wants such and such", and then draw conclusions from that statement——but those conclusions may not follow from the observations, even though they would follow if the mind were a human. The desirable properties that come along with a human wanting X may not come along with designs, incentives, selection, behavior, or any other feature, even if that feature does overlap in some ways with our familiar idea of wanting.
That human wanting shows great variety, does not in general argue against the use of any other idea of wanting. Both our familiar ideas about human wanting, and our more theoretical ideas about wanting, might prove to be useful starting ideas for creating capable minds with specifiable effects.
Human wanting is half of an aligned AGI
Human wanting is proleptic, ambiguous, process-level, inexplicit, and so on. Human wanting is provisional. Because human wanting is provisional, the AGI must be correctable (corrigible). The AGI must be correctable through-and-through, in all aspects (since all aspects touch on how the AGI+human wants), even to the point of a paradox of tolerance——the human might to correct the AGI in a way that the AGI recognizes as ruining the correctable nature of the AGI, and that should be allowed (with warning).