Superintelligence 20: The value-loading problem

KatjaGrace

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twentieth section in the reading guide: the value-loading problem.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “The value-loading problem” through “Motivational scaffolding” from Chapter 12

Summary

Capability control is a short-term measure: at some point, we will want to select the motivations of AIs. (p185)
The value loading problem: how do you cause an AI to pursue your goals? (p185)
Some ways to instill values into an AI:
1. Explicit representation: Hand-code desirable values (185-7)
2. Evolutionary selection: Humans evolved to have values that are desirable to humans—maybe it wouldn't be too hard to artificially select digital agents with desirable values. (p187-8)
3. Reinforcement learning: In general, a machine receives reward signal as it interacts with the environment, and tries to maximize the reward signal. Perhaps we could reward a reinforcement learner for aligning with our values, and it could learn them. (p188-9)
4. Associative value accretion: Have the AI acquire values in the way that humans appear to—starting out with some machinery for synthesizing appropriate new values as we interact with our environments. (p189-190)
5. Motivational scaffolding: start the machine off with some values, so that it can run and thus improve and learn about the world, then swap them out for the values you want once the machine has sophisticated enough concepts to understand your values. (191-192)
6. To be continued...

Another view

Ernest Davis, on a 'serious flaw' in Superintelligence:

The unwarranted belief that, though achieving intelligence is more or less easy, giving a computer an ethical point of view is really hard.

Bostrom writes about the problem of instilling ethics in computers in a language reminiscent of 1960’s era arguments against machine intelligence; how are you going to get something as complicated as intelligence, when all you can do is manipulate registers?

The definition [of moral terms] must bottom out in the AI’s programming language and ultimately in primitives such as machine operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

In the following paragraph he goes on to argue from the complexity of computer vision that instilling ethics is almost hopelessly difficult, without, apparently, noticing that computer vision itself is a central AI problem, which he is assuming is going to be solved. He considers that the problems of instilling ethics into an AI system is “a research challenge worthy of some of the next generation’s best mathematical talent”.

It seems to me, on the contrary, that developing an understanding of ethics as contemporary humans understand it is actually one of the easier problems facing AI. Moreover, it would be a necessary part, both of aspects of human cognition, such as narrative understanding, and of characteristics that Bostrom attributes to the superintelligent AI. For instance, Bostrom refers to the AI’s “social manipulation superpowers”. But if an AI is to be a master manipulator, it will need a good understanding of what people consider moral; if it comes across as completely amoral, it will be at a very great disadvantage in manipulating people. There is actually some truth to the idea, central to The Lord of the Rings and Harry Potter, that in dealing with people, failing to understand their moral standards is a strategic gap. If the AI can understand human morality, it is hard to see what is the technical difficulty in getting it to follow that morality.

Let me suggest the following approach to giving the superintelligent AI an operationally useful definition of minimal standards of ethics that it should follow. You specify a collection of admirable people, now dead. (Dead, because otherwise Bostrom will predict that the AI will manipulate the preferences of the living people.) The AI, of course knows all about them because it has read all their biographies on the web. You then instruct the AI, “Don’t do anything that these people would have mostly seriously disapproved of.”

This has the following advantages:

It parallels one of the ways in which people gain a moral sense.

It is comparatively solidly grounded, and therefore unlikely to have an counterintuitive fixed point.

It is easily explained to people.

Of course, it is completely impossible until we have an AI with a very powerful understanding; but that is true of all Bostrom’s solutions as well. To be clear: I am not proposing that this criterion 3should be used as the ethical component of every day decisions; and I am not in the least claiming that this idea is any kind of contribution to the philosophy of ethics. The proposal is that this criterion would work well enough as a minimal standard of ethics; if the AI adheres to it, it will not exterminate us, enslave us, etc.

This may not seem adequate to Bostrom, because he is not content with human morality in its current state; he thinks it is important for the AI to use its superintelligence to find a more ultimate morality. That seems to me both unnecessary and very dangerous. It is unnecessary because, as long as the AI follows our morality, it will at least avoid getting horribly out of whack, ethically; it will not exterminate us or enslave us. It is dangerous because it is hard to be sure that it will not lead to consequences that we would reasonably object to. The superintelligence might rationally decide, like the King of Brobdingnag, that we humans are “the most pernicious race of little odious vermin that nature ever suffered to crawl upon the surface of the earth,” and that it would do well to exterminate us and replace us with some much more worthy species. However wise this decision, and however strongly dictated by the ultimate true theory of morality, I think we are entitled to object to it, and to do our best to prevent it. I feel safer in the hands of a superintelligence who is guided by 2014 morality, or for that matter by 1700 morality, than in the hands of one that decides to consider the question for itself.

Notes

1. At the start of the chapter, Bostrom says ‘while the agent is unintelligent, it might lack the capability to understand or even represent any humanly meaningful value. Yet if we delay the procedure until the agent is superintelligent, it may be able to resist our attempt to meddle with its motivation system.' Since presumably the AI only resists being given motivations once it is turned on and using some other motivations, you might wonder why we wouldn't just wait until we had built an AI smart enough to understand or represent human values, before we turned it on. I believe the thought here is that the AI will come to understand the world and have the concepts required to represent human values by interacting with the world for a time. So it is not so much that the AI will need to be turned on to become fundamentally smarter, but that it will need to be turned on to become more knowledgeable.

2. A discussion of Davis' response to Bostrom just started over at the Effective Altruism forum.

3. Stuart Russell thinks of value loading as an intrinsic part of AI research, in the same way that nuclear containment is an intrinsic part of modern nuclear fusion research.

4. Kaj Sotala has written about how to get an AI to learn concepts similar to those of humans, for the purpose of making safe AI which can reason about our concepts. If you had an oracle which understood human concepts, you could basically turn it into an AI which plans according to arbitrary goals you can specify in human language, because you can say 'which thing should I do to best forward [goal]?' (This is not necessarily particularly safe as it stands, but is a basic scheme for turning conceptual understanding and a motivation to answer questions into any motivation).

5. Inverse reinforcement learning and goal inference are approaches to having machines discover goals by observing actions—these could be useful instilling our own goals into machines (as has been observed before).

6. If you are interested in whether values are really so complex, Eliezer has written about it. Toby Ord responds critically to the general view around the LessWrong community that value is extremely likely to be complex, pointing out that this thesis is closely related to anti-realism—a relatively unpopular view among academic philosophers—and so that overall people shouldn't be that confident. Lots of debate ensues.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

How can we efficiently formally specify human values? This includes for instance how to efficiently collect data on human values and how to translate it into a precise specification (and at the meta-level, how to be confident that it is correct).
Are there other plausible approaches to instil desirable values into a machine, beyond those listed in this chapter?
Investigate further the feasibility of particular approaches suggested in this chapter.

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about how an AI might learn about values. To prepare, read “Value learning” from Chapter 12. The discussion will go live at 6pm Pacific time next Monday 2 February. Sign up to be notified here.

I'm actually not clear on what exactly falls under the 'value loading problem'. These seem like somewhat separate issues:

Figuring out what we want in any sense (e.g. utilitarianism with lots of details nailed down)
Translating 'any sense' into being able to write down what we want in a formal way
Causing the values to be the motivations of an AI

Is the 'value loading problem' some subset of these?

Figuring out what we want in any sense (e.g. utilitarianism with lots of details nailed down)

That's the "value problem"

Translating 'any sense' into being able to write down what we want in a formal way

That's the "formalization problem"

Causing the values to be the motivations of an AI

This is the "value loading problem"

Well, then, it seems like almost all the difficulty is in the value and formalization problems. Once we've really formalized it, it's 99% of the way to machine code from where it started as human intuition.

Doesn't that mean the value loading strategy is an alternative to the (direct) formalization strategy?

Via, say, doubly-indirect meta-ethics? Well, we need to decide that that's really the decision algorithm that's going to result in the right answer, both that it's ethically correct and predictably converges on that ethically correct result.

Explicitly figuring out what our values are and formalizing them, is only one possible sequence of steps to get AI with our values.

It seems like most people don't think that this approach will work. So there are a number of proposals to use AI itself to assist in this process. E.g. "motivational scaffolding" sounds like it solves the second step (formalizing the values.)

Bostrom says a human doesn't try to disable its own goal accretion (though that process alters its values) in part because it is not well described as a utility maximizer (p190, footnote 11). Why assume AI will be so much better described as a utility maximizer that this characteristic will cease to hold?

I can think of a few reasons why it might seem like humans don't try to disable goal accretion:

*Humans can't easily perform reliable self-modifications, and as a result usually don't consider things like disabling goal accretion as something that's possible.

*When a human believes something strongly enough to want to try to fix it as a goal, mechanisms kick in to hold it in place that don't involve consciously considering value accretion disabling as a goal. For example, confirmation bias and other cognitive biases, making costly commitments to join a group of people who also share that goal (which makes it harder to take it away, ect.).

*Cognitive biases lead us to underestimate the amount values have shifted in the past, and wildly underestimate how our values might shift in the future

*Humans believe that all value accretion is good, because it lead to the present set of values, and they are good and right. Also, humans believe that their values will not change in the future, because they feel objectively good and right (subjectively objective).

*Our final goals are inaccessible, so we don't really know what it is we would want to fix as our goals.

*Our actual final goals (if there is something like that that can be meaningfully specified) include keeping the goal accretion mechanism running.

It seems likely that an AI system which humans understand well enough to design might have fewer of these properties.

Did you disagree with anything in this section?

Not disagreeing with you, but when Toby Ord says that ethical realism favors simplicity of values - well, he just asserts it as far as I can see. I can think of my own Ockhamist/Solomonoff style argument for it, but it seems pretty weak. Human beings, and other moral patients, are awfully complex. Why shouldn't what is valuable be similarly complex? And one could run an Ockhamist argument for keeping one's model of human subjective values simple, too. The realism/anti-realism debate just looks completely orthogonal.

Anyone who thinks he's onto something, please explain what I'm missing.

I think what you are missing is that a lot of what we have learned about the universe turned out to be much simpler than one would have thought. A World with roughly 12 elementary particles was beyond the wildest dreams of some older civilizations, say, the Arabs in the 14th century. We have come to terms with the idea that reality is by and large summarizable by a few principles. So if morality turns out to be Real, it will probably also be somewhat simple. That's the intuition.

I started from the premise that human beings are complex, however few types of elementary particles we are made of. Put differently: physics is simple; history, geography, and zoology are complex. Morality could be like one of the last three.

But certainly you agree that the trend has been to find simple explanations for complex phenomena, which is from where those who hold that intuition are departing from.

I wonder [read the book got the t-shirt & sticker] if it really is -generally- all so complex. I mean a lot of the imputations are anthropomorphic. Machines are dead brains that are switched on. There is nothing else. Unless mimickry which might con some people some of the time. 2001 the movie was still the closest to a machine thinking along certain logic lines. As for rebelling robots, independent machine inteliigences [unless hybrid brain interfaces] I cannot forsee anything in this book that is even relevant. Nice thought experiments though. I am finished. This is it.

What was most interesting this week?

Do you find any of the methods discussed this week promising?

What do you think of Ernest Davis' view? Is the value loading problem a problem?

Did anyone else immediately try to come up with ways Davis' plan would fail? One obvious failure mode would be in specifying which dead people count - if you say "the people described in these books," the AI could just grab the books and rewrite them. Hmm, come to think of it: is any attempt to pin down human preferences by physical reference rather than logical reference vulnerable to tampering of this kind, and therefore unworkable? I know EY has written many times before about a "giant logical function that computes morality", but this puts that notion in a bit of a different light for me. Anyway, I'm sure there other less obvious ways Davis' plan could go wrong too. I also suspect he's sneaking a lot into that little word, "disapprove".

In general though, I'm continually astounded at how many people, upon being introduced to the value loading problem and some of the pitfalls that "common-sense" approaches have, still say "Okay, but why couldn't we just do [idea I came up with in five seconds]?"

One obvious failure mode would be in specifying which dead people count - if you say "the people described in these books," the AI could just grab the books and rewrite them. Hmm, come to think of it: is any attempt to pin down human preferences by physical reference rather than logical reference vulnerable to tampering of this kind, and therefore unworkable?

Not as such, no. It's a possible failure mode, similar to wireheading; but both of those are avoidable. You need to write the goal system in such a way that makes the AI care about the original referent, not any proxy that it looks at, but there's no particular reason to think that's impossible.

In general though, I'm continually astounded at how many people, upon being introduced to the value loading problem and some of the pitfalls that "common-sense" approaches have, still say "Okay, but why couldn't we just do [idea I came up with in five seconds]?"

Agreed.

Davis massively underestimates the magnitude and importance of the moral questions we haven't considered, which renders his approach unworkable.

I feel safer in the hands of a superintelligence who is guided by 2014 morality, or for that matter by 1700 morality, than in the hands of one that decides to consider the question for itself.

I don't. Building a transhuman civilization is going to raise all sorts of issues that we haven't worked out, and do so quickly. A large part of the possible benefits are going to be contingent on the controlling system becoming much better at answering moral questions than any individual humans are right now. I would be extremely surprised if we don't end up losing at least one order of magnitude of utility to this approach, and it wouldn't surprise me at all if it turns out to produce a hellish environment in short order. The cost is too high.

The superintelligence might rationally decide, like the King of Brobdingnag, that we humans are “the most pernicious race of little odious vermin that nature ever suffered to crawl upon the surface of the earth,” and that it would do well to exterminate us and replace us with some much more worthy species. However wise this decision, and however strongly dictated by the ultimate true theory of morality, I think we are entitled to object to it, and to do our best to prevent it.

I don't understand what scenario he is envisioning, here. If (given sufficient additional information, intelligence, rationality and development time) we'd agree with the morality of this result, then his final statement doesn't follow. If we wouldn't, it's a good old-fashioned Friendliness failure.

What Davis points out needs lots of expansion. The value problem becomes ever more labyrinthine the closer one looks. For instance, after millions of years of evolution and all human history, we ourselves still can't agree on what we want! Even within 5 minutes of your day your soul is aswirl with conflicts over balancing just the values that pertain to your own tiny life, let alone the fate of the species. Any attempt to infuse values into AI will reflect human conflicts but at a much simpler and more powerful scale.

Furthermore, the AI will figure out that humans override their better natures at a whim, agreeing universally on the evil of murder while simultaneously taking out their enemies at a whim! If there was even a possibility of programming values, we would have figured out centuries ago how to "program" psychopaths with better values (who is essentially a perfect AI missing just one thing: perfectly good values). I believe we are fooling ourselves to think a moral machine is possible.

I would also add that "turning on" the AI is not a good analogy. It becomes smarter than us in increments (as in Deep Blue, Watson, Turing test, etc.) Just like Hitler growing up there will not be a "moment" when the evil appears so much as it will overwhelm us from our blind spot- suddenly being in control without our awareness...

What do you think of Ernest Davis' view? Is the value loading problem a problem?

Davis massively underestimates the magnitude and importance of the moral questions we haven't considered, which renders his approach unworkable.

I feel safer in the hands of a superintelligence who is guided by 2014 morality, or for that matter by 1700 morality, than in the hands of one that decides to consider the question for itself.

2ahbwramc11y

Did anyone else immediately try to come up with ways Davis' plan would fail? One obvious failure mode would be in specifying which dead people count - if you say "the people described in these books," the AI could just grab the books and rewrite them. Hmm, come to think of it: is any attempt to pin down human preferences by physical reference rather than logical reference vulnerable to tampering of this kind, and therefore unworkable? I know EY has written many times before about a "giant logical function that computes morality", but this puts that notion in a bit of a different light for me. Anyway, I'm sure there other less obvious ways Davis' plan could go wrong too. I also suspect he's sneaking a lot into that little word, "disapprove". In general though, I'm continually astounded at how many people, upon being introduced to the value loading problem and some of the pitfalls that "common-sense" approaches have, still say "Okay, but why couldn't we just do [idea I came up with in five seconds]?"