MIRI recently blogged about the workshop paper that I presented at AAAI.
My abstract:
Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.
Their summary:
Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.
Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.
Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.
Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.
Would be curious to hear whether anyone here has any thoughts. This is basically a "putting rough ideas together and seeing if they make any sense" kind of paper, aimed at clarifying the hypothesis and seeing whether others kind find any obvious holes in it, rather than being at the stage of a serious scientific theory yet.
So to first note a few things:
Those things said, your final step does sound reasonably close to the kind of thing I was thinking of. We can look at some particular individual, note that a combination of the surrounding culture and their own sexuality made them try out flirting and dancing and find both rewarding, and then come to value those things for their own sake. And conclude that their ideal future would probably be likely to include fair amounts of both.
Though of course there are also all kinds of questions about, for example, exactly how rewarding and enjoyable do they find those things. Maybe someone feels positive about the concept of being the kind of person who'd enjoy dance, but isn't actually the kind of person who'd enjoy dance. Resolving that kind of a conflict would probably either mean helping them to learn to enjoy dance, or to help them give up the ideal of needing to be that kind of a person. The correct action would depend on exactly how deeply their reasons for not enjoying dance ran, and on what their other values were.
Also it's possible that upon an examination of the person's psychology, the AI would conclude that while they did enjoy flirting and dance, there were other things that they would enjoy more - either right now, or given enough time and effort. The AI might then work to create a situation where they could focus more on those more rewarding things.
Now with these kinds of questions there is the issue of exactly what kinds of interventions are allowed by the AI. After all, the most effective way of making somebody have maximally rewarding experiences would be to rewire their brain to always receive maximal reward. Here's where I take a page of Paul Christiano's book and his suggestion of approval-directed agents, and propose that the AI is only allowed to do the kinds of things to the human that the human's current values would approve of. So if the human doesn't want to have their brain rewired, but is okay with the AI suggesting new kinds of activities that they might enjoy, then that's what happens. (Of course, "only doing the kinds of things that the human's current values would approve of" is really vague and hand-wavy at this point, and would need to be defined a lot better.)