I think of ambitious value learning as a proposed solution to the specification problem, which I define as the problem of defining the behavior that we would want to see from our AI system. I italicize “defining” to emphasize that this is not the problem of actually computing behavior that we want to see -- that’s the full AI safety problem. Here we are allowed to use hopelessly impractical schemes, as long as the resulting definition would allow us to in theory compute the behavior that an AI system would take, perhaps with assumptions like infinite computing power or arbitrarily many queries to a human. (Although we do prefer specifications that seem like they could admit an efficient implementation.) In terms of DeepMind’s classification, we are looking for a design specification that exactly matches the ideal specification. HCH and indirect normativity are examples of attempts at such specifications.
We will consider a model in which our AI system is maximizing the expected utility of some explicitly represented utility function that can depend on history. (It does not matter materially whether we consider utility functions or reward functions, as long as they can depend on history.) The utility function may be learned from data, or designed by hand, but it must be an explicit part of the AI that is then maximized.
I will not justify this model for now, but simply assume it by fiat and see where it takes us. I’ll note briefly that this model is often justified by the VNM utility theorem and AIXI, and as the natural idealization of reinforcement learning, which aims to maximize the expected sum of rewards, although typically rewards in RL depend only on states.
A lot of conceptual arguments, as well as experiences with specification gaming, suggest that we are unlikely to be able to simply think hard and write down a good specification, since even small errors in specifications can lead to bad results. However, machine learning is particularly good at narrowing down on the correct hypothesis among a vast space of possibilities using data, so perhaps we could determine a good specification from some suitably chosen source of data? This leads to the idea of ambitious value learning, where we learn an explicit utility function from human behavior for the AI to maximize.
This is very related to inverse reinforcement learning (IRL) in the machine learning literature, though not all work on IRL is relevant to ambitious value learning. For example, much work on IRL is aimed at imitation learning, which would in the best case allow you to match human performance, but not to exceed it. Ambitious value learning is, well, more ambitious -- it aims to learn a utility function that captures “what humans care about”, so that an AI system that optimizes this utility function more capably can exceed human performance, making the world better for humans than they could have done themselves.
It may sound like we would have solved the entire AI safety problem if we could do ambitious value learning -- surely if we have a good utility function we would be done. Why then do I think of it as a solution to just the specification problem? This is because ambitious value learning by itself would not be enough for safety, except under the assumption of as much compute and data as desired. These are really powerful assumptions -- for example, I'm assuming you can get data where you put a human in an arbitrarily complicated simulated environment with fake memories of their life so far and see what they do. This allows us to ignore many things that would likely be a problem in practice, such as:
- Attempting to use the utility function to choose actions before it has converged
- Distributional shift causing the learned utility function to become invalid
- Local minima preventing us from learning a good utility function, or from optimizing the learned utility function correctly
The next few posts in this sequence will consider the suitability of ambitious value learning as a solution to the specification problem. Most of them will consider whether ambitious value learning is possible in the setting above (infinite compute and data). One post will consider practical issues with the application of IRL to infer a utility function suitable for ambitious value learning, while still assuming that the resulting utility function can be perfectly maximized (which is equivalent to assuming infinite compute and a perfect model of the environment after IRL has run).
Maybe it's not that bad? For example I can imagine learning the human utility function in two stages. The first stage uses the current human to learn a partial utility function (or some other kind of data structure) about how they want their life to go prior to figuring out their full utility function. E.g., perhaps they want a safe and supportive environment to think, talk to other humans, and solve various philosophical problems related to figuring out one's utility function, with various kinds of assistance, safeguards, etc. from the AI (but otherwise no strong optimizing forces acting upon them). In the second stage, the AI use that information to compute a distribution of "preferred" future lives and then learns the full utility function only from those lives.
Another possibility is if we could design an Oracle AI that is really good at answering philosophical questions (including understanding what our confused questions mean), we can just ask it "What is my utility function?"
So I would argue that your proposal is one example of how you could learn a utility function from humans assuming you know the full human policy, where you are proposing that we pay attention to a very small part of the human policy (the part that specifies our answers to the question "how do we want our life to go" at the current time, and then the part that specifies our behavior in the "preferred" future lives).
You can think of this as ambitious value learning with a hardcoded structure by which the AI is supposed to infer the utilit... (read more)