If an AI observers strong inconsistency in my liking-wanting-approving, should it stop (and inform me about it), or try to agregate my preference anyway?
I like the idea of multidimensional preferences, such as liking/wanting/approving, not just maximizing a single utility function. I suspect that there are more dimensions that are worth considering. For example "deserving" is the one often missed by those with reasonably happy childhood. In those who faced emotional abuse growing up and eventually internalized it, the difference between wanting and deserving can be very considerable. It is quite common to hear "I want to be happy" but when you ask something like "Do you feel that you deserve to be happy?" the answer is often either a pause or a negative, something like "I am not a good person, I do not deserve to be happy." Not sure if this can be incorporated into your model, and what other axes are potentially worth considering.j
Yes, other types of "preferences" are conceivable. For example, if a person is acting under an order of another person, like a soldier, he may not like, nor want or approve the order, but still obey it, as he has to.
Interesting! I'm still concerned that, since you need to aggregate these things in the end anyhow (because everything is commensurable in the metric of affecting decisions), the aggregation function is going to be allowed to be very complicated and dependent on factors that don't respect the separation of this trichotomy.
But it does make me consider how one might try to import this into value learning. I don't think it would work to take these categories as given and then try to learn meta-preferences to sew them together, but most (particularly more direct) value learning schemes have to start with some "seed" of examples. If we draw that seed only from "approving," does that mean that the trained AI isn't going to value wanting or liking enough? Or would everything probably be fine, because we wouldn't approve of bad stuff?
We analyze the usefulness of the framework of preference types [Berridge et al. 2009] to value learning by an artificial intelligence. In the context of AI the purpose of value learning is giving an AI goals aligned with humanity. We will lay the groundwork for establishing how human preferences of different types are (descriptively) or ought to be (normatively) aggregated.
This blogpost (1) describes the framework of preference types and how these can be inferred, (2) considers how an AI could aggregate our preferences, and (3) suggests how to choose an aggregation method. Lastly, we consider potential future directions that could be taken in this area.
Motivation
The reason that the concept of multiple preference types is useful for AI is that people often have internal conflicts. Examples of internal conflicts:
We think that these internal conflicts can be understood as us having preferences of different types that compete with one another. If an AI ignores the fact that we can have competing preferences, then when it considers a state it will only infer our preference for that state based on one proxy, which will often leave the AI with an incomplete picture. Examples in which taking into account only one data source for preferences leads to complications:
Anticipated application of the approach:
Identifying and aggregating different preference types could help with the value learning problem, where value learning is “making AI’s goals more accurately aligned with human goals”. It could help with value learning in one of the following ways:
Some very concrete examples of where our approach would be used are:
Framework: Liking, Wanting and Approving
In this post we focus on three specific preference types that we think are valid and distinct from each other. We work with the preference framework of liking, wanting and approving, which are defined as follows:
The following examples are inspired by this blogpost [Alexander 2011].
+liking/+wanting/+approving: Experiencing love.
+liking/+wanting/-approving: Eating chocolate.
+liking/-wanting/+approving: Many hobbies (are enjoyable and although people approve of them they can rarely bring themselves to do it).
+liking/-wanting/-approving: Eating foie gras.
-liking/+wanting/+approving: Running just before the runner’s high.
-liking/+wanting/-approving: Addicts that no longer enjoy their drug.
-liking/-wanting/+approving: Working.
-liking/-wanting/-approving: Torture.
There are several motivations for our choice of the framework of liking, wanting and approving. Firstly, for each combination of positive or negative liking, wanting or approving, there is an example that fits the combination, so they are independent. Secondly, we (humans) use data on body language, stated preferences and actions to form our theory of mind of other people. Lastly, there is a large body of research on these preference types, which makes it easier to work with them.
Relations between preference types:
These preferences are distinct [Berridge et al. 2009], but can influence one another [van Gelder 1998]. For example, it may in general cause you more pleasure to do something you approve of. We would like to see a comprehensive, descriptive model of preference types in humans in cognitive science. These interactions could for example be modeled as a dynamical system [van Gelder 1998].
The observables:
Liking, wanting and approving are for the most part hidden processes. They are not directly observable, but they influence observable behaviors. As a proxy for liking we propose to use facial expressions, body language or responses to questionnaires. Although a cognitive scan may be the most accurate proxy for liking, there is evidence to suggest both that facial expressions and body language are indicators of pleasure and pain [Algom et al. 1994] and that they can be classified well enough to make them technically feasible proxies [Giorgiana et al. 2012]. The observable proxy of wanting is revealed preferences. We propose to encode a proxy for approval via stated preferences.
Extracting reward functions from the data sources:
In order to aggregate (however we choose to do that), we need to make sure the preferences are encoded in commensurable ways. To make our approach attuned to reinforcement learning we propose to encode the preferences as reward functions. For this we need to collect human data in a specific environment with well-defined states, in order to ensure all three sets of data refer to (different) preference types about the same state, and then normalise.
Examples of collecting data:
Personal assistant: Define states of the living room.
Taxi-driver dataset: Define states of taxi-drivers.
The reward functions extracted should be renormalized, to make them commensurable.
We have considered other preference types as well, such as disgust, serotonin, oxytocin, rationalizing and remembered preferences, see this doc.
Aggregating Preferences
Our initial approach to establishing a method for aggregation of preference types was to find desiderata any potential aggregation function should comply with.
As a source of desiderata, we examined existing bodies of research that dealt with aggregating preferences, either across individuals or between different types. We looked at the following fields and corresponding desiderata:
To illustrate what we mean by desiderata for aggregating, and aggregation methods, and how these could be used with the preference types framework, we have the following examples.
Desiderata for aggregating:
The approach of loaning is used to provide initial desiderata for inspiration, but for an in-depth analysis, it is not generalizable.
Aggregation methods:
We now consider some specific aggregation methods and see whether they satisfy the desiderata.
Example:
Consider the situation where Bob is on a diet, but still has a sweet tooth. He can be in the following states {s1 = on a diet, saw sugar and ate sugar, s2 = on a diet, saw sugar and did not eat sugar, s3 = on a diet, did not see sugar and did not eat it}.
Bob has the following reward functions:
Define aggregate functions:
Then s1 maximizes R1 and R2, and s3 maximizes R3.
Choosing an Aggregation Method
As the choice of aggregation method will depend on the particular scenario, it should be determined on a case-by-case basis.
Some useful approaches would be:
For a more general approach to the problem, future work could:
Final Remarks
Applicability:
This approach is useful, even with an unsophisticated aggregation method:
Future work:
Helpful next steps for any researchers that would like to take on the project would seem to be:
Other directions:
If we want to allow the separate reward functions to have different discounting factors, then they can not be aggregated into one reward function, unless we include time as a state-feature.
References
Kent C. Berridge, Terry E. Robinson, and J. Wayne Aldridge, Dissecting components of reward: 'liking', 'wanting', and learning, Curr Opin Pharmacol. Feb; 9(1): 65–73, 2009.
Kent C. Berridge, John P. O’Doherty, Experienced Utility to Decision Utility, in Neuroeconomics (Second Edition), 2014.
Scott Alexander, Approving Reinforces Low-Effort Behaviours, https://www.lesswrong.com/posts/yDRX2fdkm3HqfTpav/approving-reinforces-low-effort-behaviors, 2011.
Tim van Gelder, The Dynamical Hypothesis in Cognitive Science, Behav Brain Sci. Oct; 21(5):615-28; discussion 629-65, 1998.
Daniel Algom, Sonia Lubel, Psychophysics in the field: Perception and memory for labor pain, Perception & Psychophysics 55: 133. https://doi.org/10.3758/BF03211661, 1994.
Geovanny Giorgana, Paul G. Ploeger, Facial expression recognition for domestic service robots, in Robo Cup 2011: Robot Soccer World Cup XV, pp. 353-364, 2012.
Daniel Read, George Loewenstein, Shobana Kalyanaraman, Mixing virtue and vice: combining the immediacy effect and the diversification heuristic, Journal of Behavioral Decision Making; Dec; 12, 4; ABI/INFORM Global pg. 257, 1999.
Seth Baum, Social Choice Ethics in Artificial Intelligence, Forthcoming, AI & Society, DOI 10.1007/s00146-017-0760-1 https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3046725, 2017.