I'm pleased to announce a new paper from MIRI about The Value Learning Problem.
Abstract:
A superintelligent machine would not automatically act as intended: it will act as programmed, but the fit between human intentions and formal specification could be poor. We discuss methods by which a system could be constructed to learn what to value. We highlight open problems specific to inductive value learning (from labeled training data), and raise a number of questions about the construction of systems which model the preferences of their operators and act accordingly.
This is the sixth of six papers supporting the MIRI technical agenda. It motivates the need for value learning, a bit, and gives some early thoughts on how the problem could be approached (while pointing to some early open problems in the field).
I'm pretty excited to have the technical agenda and all its supporting papers published. Next week I'll be posting an annotated bibliography that gives more reading for each subject. The introduction to the value learning paper has been reproduced below.
Consider a superintelligent system, in the sense of Bostrom (2014), tasked with curing cancer by discovering some process which eliminates cancerous cells from a human body without causing harm to the human (no easy task to specify in its own right). The resulting behavior may be quite unsatisfactory. Among the behaviors not ruled out by this goal specification are stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping human test subjects.
The intended goal, hopefully, was to cure cancer without doing any of those things, but computer systems do not automatically act as intended. Even a system smart enough to figure out what was intended is not compelled to act accordingly: human beings, upon learning that natural selection ``intended" sex to be pleasurable only for purposes of reproduction, do not thereby conclude that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being unmotivated to alter their preferences accordingly. For similar reasons, when constructing an artificially intelligent system, it is not sufficient to construct a system intelligent enough to understand human intentions; the system must also be purposefully constructed to pursue them (Bostrom 2014, chap. 8).
How can this be done? Human goals are complex, culturally laden, and context-dependent. Furthermore, the notion of ``intention" itself may not lend itself to clean formal specification. By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended?
A superintelligent machine would be useful for its ability to find plans that its programmers never imagined, to identify shortcuts that they never noticed or considered. That capability is a double-edged sword: a machine that is extraordinarily effective at achieving its goals might have unexpected negative side effects, as in the case of robotic laboratories damaging the biosphere. There is no simple fix: a superintelligent system would need to learn detailed information about what is and isn't considered valuable, and be motivated by this knowledge, in order to safely solve even simple tasks.
This value learning problem is the focus of this paper. Section 2 discusses an apparent gap between most intuitively desirable human goals and attempted simple formal specifications. Section 3 explores the idea of frameworks through which a system could be constructed to learn concrete goals via induction on labeled data, and details possible pitfalls and early open problems. Section 4 explores methods by which systems could be built to safely assist in this process.
Given a system which is attempting to act as intended, philosophical questions arise: How could a system learn to act as intended when the operators themselves have poor introspective access to their own goals and evaluation criteria? These philosophical questions are discussed briefly in Section 5.
A superintelligent system under the control of a small group of operators would present a moral hazard of extraordinary proportions. Is it possible to construct a system which would act in the interests of not only its operators, but of all humanity, and possibly all sapient life? This is a crucial question of philosophy and ethics, touched upon only briefly in Section 6, which also motivates a need for caution and then concludes.
Regarding 2: So, I am a little surprised that step 2: Valuable goals cannot be directly specified is taken as a given.
If we consider an AI as rational optimizer of the ONE TRUE UTILITY FUNCTION, we might want to look for best available approximations of it short term. The function i have in mind is life expectancy(DALY or QALY), since to me, it is easier to measure than happiness. It also captures a lot of intuition when you ask a person the following hypothetical:
if you could be born in any society on earth today, what one number would be most congruent with your preference? Average life expectancy captures very well which societies are good to be born at.
I am also aware of a ton of problems with this, since one has to be careful to consider humans vs human/cyborg hybrids, time spent in cryo-sleep or normal sleep vs experiential mind-moments. However, i'd rather have an approximate starting point for direct specification, rather than give up on the approach all-together.
Regarding 5: There is an interesting "problem" with "do what i would want if i had more time to think" that happens not in the case of failure, but in the case of success. Let's say we have our happy go lucky life expectancy maximizing death-defeating FAI. It starts to look at society and sees that some widely accepted acts are totally horrifying from its perspective. It's "morality" surpasses ours, which is just an obvious consequence of it's intelligence surpassing ours. Something like the amount of time we make children sit at their desks at school destroys their health to the point of disallowing immortality. This particular example might not be so hard to convince people of, but there could be others. At this point, they would go against a large number of people, to try and create its own schools which teach how bad the other schools are (or something). The governments don't like this and shut it down because we still can for some reason.
Basically the issue is: this AI behaving in a friendly manner, which we would understand if we had enough time and intelligence. But we don't. So we don't have enough intelligence to determine if it is actually friendly or not.
Regarding 6: I feel that you haven't even begun to approach the problem of a sub-group of people controlling the AI. The issue gets into the question of peaceful transitions of that power over the long term. There is also an issue of if you come up with a scheme of who gets to call the shots around the AI that's actually a good idea, convincing people that it is a good idea instead of the default "let the government do it" is in itself a problem. It's similar in principle to 5.
Whoa, how are you measuring the disability/quality adjustment? That sounds like sneaking in 'happiness' measurements, and there are a bunch of challenges: we already run into issues where people who have a condition rate it as less bad than people who don't have it. (For example, sighted people rate being blind as worse than blind people rate being blind.)
... (read more)