This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.
Welcome. This week we discuss the twenty-first section in the reading guide: Value learning.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: “Value learning” from Chapter 12
Summary
- One way an AI could come to have human values without humans having to formally specify what their values are is for the AI to learn about the desired values from experience.
- To implement this 'value learning' we would need to at least implicitly define a criterion for what is valuable, which we could cause the AI to care about. Some examples of criteria:
- 'F' where 'F' is a thing people talk about, and their words are considered to be about the concept of interest (Yudkowsky's proposal) (p197-8, box 11)
- Whatever is valued by another AI elsewhere in the universe values (Bostrom's 'Hail Mary' proposal) (p198-9, box 12)
- What a specific virtual human would report to be his value function, given a large amount of computing power and the ability to create virtual copies of himself. The virtual human can be specified mathematically as the simplest system that would match some high resolution data collected about a real human (Christiano's proposal). (p200-1)
- The AI would try to maximize these implicit goals given its best understanding, while at the same time being motivated to learn more about its own values.
- A value learning agent might have a prior probability distribution over possible worlds, and also over correct sets of values conditional on possible worlds. Then it could choose its actions to maximize their expected value, given these probabilities.
Another view
Paul Christiano describes an alternative to loading values into an AI at all:
Most thinking about “AI safety” has focused on the possibility of goal-directed machines, and asked how we might ensure that their goals are agreeable to humans. But there are other possibilities.
In this post I will flesh out one alternative to goal-directed behavior. I think this idea is particularly important from the perspective of AI safety.
Approval-directed agents
Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action:
Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.
I’ll call this “approval-directed” behavior throughout this post, in contrast with goal-directed behavior. In this context I’ll call Hugh an “overseer.”
Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself. For example, if Arthur is playing chess he should make moves that are actually good—not moves that Hugh thinks are good.
...[However, there are many reasons Hugh would want to use the proposal]...
In most situations, I would expect approval-directed behavior to capture the benefits of goal-directed behavior, while being easier to define and more robust to errors.
If this interests you, I recommend the much longer post, in which Christiano describes and analyzes the proposal in much more depth.
Notes
1. An analogy
An AI doing value learning is in a similar situation to me if I want to help my friend but don't know what she needs. Even though I don't know explicitly what I want to do, it is defined indirectly, so I can learn more about it. I would presumably follow my best guesses, while trying to learn more about my friend's actual situation and preferences. This is also what we hope the value learning AI will do.
2. Learning what to value
If you are interested in value learning, Dewey's paper is the main thing written on it in the field of AI safety.
3. Related topics
I mentioned inverse reinforcement learning and goal inference last time, but should probably have kept them for this week, to which they are more relevant. Preference learning is another related subfield of machine learning, and learning by demonstration is generally related. Here is a quadcopter using inverse reinforcement learning to infer what its teacher wants it to do. Here is a robot using goal inference to help someone build a toy.
4. Value porosity
Bostrom has lately written about a new variation on the Hail Mary approach, in which the AI at home is motivated to trade with foreign AIs (via everyone imagining each other's responses), and has preferences that are very cheap for foreign AIs to guess at and fulfil.
5. What's the difference between value learning and reinforcement learning?
We heard about reinforcement learning last week, and Bostrom found it dangerous. Since it also relies on teaching the AI values by giving it feedback, you might wonder how exactly the proposals relate to each other.
Suppose the owner of an AI repeatedly comments that various actions are 'friendly'. A reinforcement learner would perhaps care about hearing the word 'friendly' as much as possible. A value learning AI on the other hand would take use of the word 'friendly' as a clue about a hidden thing that it cares about. This means if the value learning AI could trick the person into saying 'friendly' more, this would be no help to it—the trick would just make the person's words a less good clue. The reinforcement learner on the other hand would love to get the person to say 'friendly' whenever possible. This difference also means the value learning AI might end up doing things which it does not expect its owner to say 'friendly' about, if it thinks those actions are supported by the values that it learned from hearing 'friendly'.
In-depth investigations
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
- Expand upon the value learning proposal. What kind of prior over what kind of value functions should a value learning AI be given? As an input to this, what evidence should be informative about the AI's values?
- Analyze the feasibility of Christiano’s proposal for addressing the value-loading problem.
- Analyze the feasibility of Bostrom’s “Hail Mary” approach to the value-loading problem.
- Analyze the feasibility of Christiano's newer proposal to avoid learning values.
- Investigate the applicability of the related fields mentioned above to producing beneficial AI.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about the two other ways to direct the values of AI. To prepare, read “Emulation modulation” through “Synopsis” from Chapter 12. The discussion will go live at 6pm Pacific time next Monday 9 February. Sign up to be notified here.
If the AI takes your saying 'friendly' to be a consequence of something being a positive example, then it doesn't think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting "Friendly!" isn't just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say "Friendly!" is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.