Thanks to Steve Byrnes, Adam Shimi, John Wentworth, and Peter Barnett for feedback.
 

In a nutshell, my plan A is to understand what we want from superintelligent AI really, really well. So well that we can write down a way of modeling humans that illuminates human preferences including higher-order preferences about how we want to be modeled, and do this in a principled rather than ad-hoc way. Achieving this understanding is highly ambitious, in a way that is mostly but not entirely parallel to "ambitious value learning."

If we understand value learning before we build superintelligent AI, there's a straightforward path to achieving a good future without paying a costly alignment tax - by alignment tax I mean all those things that slow down aligned AI being developed and "ready for liftoff," that collectively create selection pressure against safety. This problem is more tractable than many people think.

I think of this plan as an instance of the more general plan "solve value learning first." Some other tenable plans are "try to put humans in control," "get the AI to do good things prosaically," and "tinker with the AI until it value learns" - plus intermediate points between these.

 

What do I think the future is like?

I expect superintelligent AI in the short to medium term, centralized around a small number of points of development.

By short to medium term, I mean I'd put my 50% confidence interval between 2031 and 2049. I don't think we need compute to be many orders of magnitude cheaper, and I don't think we need two or more paradigm shifts on the order of neural nets overtaking support vector machines. The timeline is urgent, but not to the point that we should start ditching things like blue-sky research or gradual coalition-building.

By centralized, I mean it's possible to make big changes by having good solutions implemented in a small number of systems. Coordination may be important, but it isn't an inherent part of solving the problem.

All that said, that's just what I think is going to happen, not what's required for "solve value learning first" to be a good idea. Value learning research is still valuable in decentralized scenarios, unless we go so far as to avoid powerful agential AI long-term. Because it's more on the blue-sky end of the spectrum, longer timelines actually favor solving value learning over more atheoretic approaches, while if timelines are very short I'd advocate for "try to put humans in control" and hope for the best.

If we fail to understand value learning before we build superintelligent AI, I'm worried about some combination of groups committed to building aligned AI being less competitive because we can't learn human values efficiently, and practical-minded alignment schemes having bad behavior in edge cases because of simplifying assumptions about humans.

A basic example: if humans are assumed not to be manipulable, then an AI that thoroughly maximizes what humans (are modeled to) want will be incredibly manipulative. From the AI's perspective, the humans love being deceived, because why else would they rate it so highly? And in fact it's a tricky technical problem to avoid manipulation without sophisticated value learning, because the notion of "manipulation" is so intertwined with human meta-preferences - labeling things as manipulation means un-endorsing some of our revealed preferences. Similar hidden gotchas can pop up in other attempts to cut corners on human modeling, and at some point it just becomes faster to solve value learning than to deal with each gotcha individually.

 

What's the broad plan?

The basic strategy can be summed up as "be Geoff Hinton" (as in godfather of deep learning Geoffrey Hinton). Know an important truth before it's practical and keep trying to make it work, so that you're probably in the process of trying it when it finally becomes practical. Right now we don't know how to do value learning right, even impractically (I'll back up this claim in a later section), so step one is to get cracking on that.

Bold claim: The alignment tax is tiny if we know what we're doing with human modeling. This is one of the big plusses of solving value learning first. A lot of alignment proposals involve steps like "get hugely expensive human feedback to train a reward model." If we actually know what we're doing, our AI can primarily learn about humans using the same data it uses to learn about the world in general. When it comes to speed, we don't want safe AI to have to wait to be perfect while the unsafe competition doesn't, and so if we actually know what we're doing, our AI will be able to reason under uncertainty and take sufficiently obvious actions quickly, while continuing to learn about humans.

These optimistic claims neglect to mention the possibility that cutting-edge AI designs will be really bad at implementing our hoped-for alignment schemes. If the most advanced AIs are built by simulating evolution, we're probably just hosed. Value learning requires components including powerful unsupervised learning and the proto-consequentialist ability to choose actions or policies based on future states or actions. Fortunately, I expect superintelligent AI that looks like near-future ML to have the building blocks handy (albeit with a probability more like 80% than 99%, so maybe I should temper that bold claim) - I think that Mu-N will take over the world long before GPT-N does. And given a paradigm amenable to value learning, learning about human values is only a small fraction as hard as learning about the entire world. If only we knew how to spell out what it is we want our AI to learn, it would learn it - but we don't know how, is the problem.

I don't expect the alignment tax to be negative - the plan isn't that the people thinking about value learning should immediately transition to building superintelligent AI by themselves. By "keep trying to make it work," I mean that if we understand well ahead of time what it is we want our AI to learn, we can keep reinterpreting the same idea to keep up with progress in AI. We can keep trying to build proofs of concept, and we get well-acquainted with what the process of "take progress in AI and use it for value learning" looks like. What we learn about human values will probably have to make it into the design of superintelligent AI via science communication and persuasion, using proofs of concept and our explicit reasoning about them as tools of communication. This is pretty scary, especially for a topic more than a little related to philosophy, but I'm actually somewhat optimistic about convincing important people of good ideas (outside of politics).

 

Is solving value learning really necessary?

What about pragmatically / prosaically learning object-level human preferences by training a large language model, or other similar plans to have low-tax alignment without sweating about philosophy? Am I saying that they can't possibly work?

I think prosaic alignment could work. I have some reservations about it that I've mentioned above, but above and beyond those I also have to admit to idealism that wants to solve the alignment problem "the right way," i.e. actually understanding what we want from superintelligent AI. We don't actually want AI that just conforms to our present-day average object-level judgments. To quote CEV, what we want "is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted."

All implementable alignment schemes are going to make simplifying assumptions, prosaic alignment just makes assumptions that are convenient for design or implementation but have a less careful story about how they still let us get what we want from superintelligent AI. This typically leads to weird edge cases where the assumptions become important. If we've skimped on human modeling, this will show in places where it's important what humans want to want, or how humans want the world to be modeled. There's probably some atheoretic ways to avoid these edge cases, but I'd rather handle them by solving value learning.

Why do we have to try to solve the whole problem at once? What about setting up a "long reflection" or other intermediate state from which we can then try to realize humanity's long-term potential?

I have nothing fundamentally against this either. But I think it involves some quite tricky technical and social problems. If we're imagining using AI to take over the world and prevent anyone else from taking it back, then in my expectation either we have a chicken and egg problem where the AI should already be aligned, or we've adopted a strategy that looks like "put some humans in control of something really powerful and hope for the best." Plausibly we could handle the problems this raises, but my comparative advantage lies in trying to find solutions where we don't have to.

 

Do we really not know how to do value learning?

We really don't know how to do value learning. At best, we currently understand how to learn the values of human-like creatures that satisfy some powerful extra conditions on their behavior. 

An example of what I mean is delegative reinforcement learning (Kosoy 2019). By delegating actions to the AI's supervisors when necessary, we can use even a somewhat dubious reward signal to learn the values of humans... so long as these humans have the superpower of avoiding "traps" in state space that might permanently harm, alter, or disenfranchise them. Because we can trust these superpowered humans' demonstrations, we don't need to try to model them as imperfect, mistake-prone beings in order to infer their preferences, we can just do RL.

This conveniently leads us to another example: using CIRL (Hadfield-Menell et al. 2016), we can learn the values of humans even if they're irrational - but only under the condition that we can write down a specific model of these humans and their irrationality, plus a model of the world in which we've specified what dynamics are the human actions. Because we don't know how to do this in the real world, we end up inferring the values of whatever processes fill the agent-shaped hole in the simplified model of the world we write down for CIRL.

The most "prosaic" example is that we can use language models to mimic human-generated text about values (e.g. Jiang et al. 2021). Eventually, this might faithfully learn the process that generates the text, which is kind of like learning human values, and then we could ask this system to evaluate possible actions, policies, or states. In fact, any system that learns to imitate human behavior would be doing "value learning" in this sense. However, the problems with this are precisely the things that the additional features of DRL and CIRL were trying to prevent. It will happily make irreversible changes on shaky grounds, which tanks its safety guarantees. It acknowledges no separation between human values and the outputs of human cognition, so it might not endorse the concrete actions that lead to a cure for cancer (since humans wouldn't know to endorse them), and it certainly wouldn't take any notion of human irrationality into account when figuring out  what we want. We might be able to extract evaluations we're happy with if we're clever about what to ask for and have a model of what it means, or take advantage of our access to internal states of the model, though I'd rather pursue this kind of cleverness as part of a principled value learning paradigm rather than as an alternative.

The dream, if we had the "textbook from the future," is that we should be able to write down a value learning scheme more efficient, more outer aligned, and more trustworthy than any of the three examples above. Efficient because it should make use of all the unsupervised data it has access to to do most of its learning about humans, rather than needing lots of supervision or a curated dataset. Outer aligned meaning the highest-ranked future its search algorithm can reach should be very good for us, rather than something that exploits our simplifying assumptions about humans or is an adversarial example. And trustworthy because we should have satisfying arguments for why this will work for real humans, and those arguments should be amenable to doing proof-of-concept empirical work. 

 

What do I want us to make progress on?

We do not, unfortunately, have the textbook from the future. But I want to move in that general direction, which means solving a lot of conceptual problems. What do I mean by "conceptual problems"? Well, right now we couldn't design a satisfactory value learner even if we allowed ourselves infinite compute - those non-compute-related problems that are stopping us are conceptual problems. A common symptom is having to stop and ask "but what would we even want part X of the algorithm to do?"

My central tool is to keep asking "What do we really want from AI?" and "What does that mean in the real world?" (And sometimes "How do humans do it?") This is how you get moves like going from "We want our values satisfied" to "We are physical systems without any platonic essence of Value floating around, so this is going to involve a choice of how to model us," and so on.

It might help if I summarize the sketch of value learning (for infinite compute) I'm most interested in right now, and then point out places where a miracle occurs.

Begin sketch:

I think a good way to think about modeling human preferences is to imagine producing a huge variety of predictive models of the world, ranging from simple and limited to complicated and all-encompassing, that all include some claims about a value function.

For example, one model of the world might treat "Charlie is dancing" as an ontologically basic property of the world, in tension with "Charlie is eating" or "Charlie is taking the bus" (these are of course merely English-language labels for key inferences that the model makes from sense data). It might claim that I like dancing, and use that to predict that today I will eat, then take the bus, then dance.

Given a zillion such predictive models, we have to somehow assign a score to each. This has to be done using a combination of labeling and some rules (which we might have higher-order preferences about) that are both about how to interpret those labels and about other ways to rank models. To give a toy example, I might start with some rules that rate the models based on how good they are at predicting versus how complicated they are. Then I might label some sense data of me dancing as good, and follow some more rules to increase the score of all the models that infer a valued state from that sense data.

Human models can be consulted to rate future states, plans, and actions, after aggregation according to yet more rules. I can tell stories of how this might go, like by translating models' value functions into the AI's native ontology before aggregating in a way that's conservative about the value of states or actions where the models disagree, but I want to emphasize that there are many options here that I don't understand yet, so stories should be held extra lightly.

Finally, we want to update the rules a little by looking at the portion of inferred human preferences that are higher-order preferences about the modeling process itself. For example, suppose it's the case that I'm a smoker and I can be well-predicted as wanting cigarettes, but if someone (including myself) is modeling me I want them to model the cigarettes as an addiction, not a "true" preference. We can update the rules of the value learning process to follow the inferred higher-order preferences a little better, and then run the process again for another step.

End sketch.

If you were following along at home, you've probably identified quite a few miracles required for this story. A lot of these miracles correspond to specific problems, both conceptual and practical, that we can try to make progress on. Here's a selection:

  • Does specifying human values in terms of value-laden predictive models really handle uncertainty appropriately, as the AI learns more about the world?
  • How can we (and should we) choose our starting rules so that highly-rated models are structured somewhat like a world that contains agents, rather than just an arbitrary computer program? Once we do that we lose some nice properties of universal Turing machines - do we need to make sure we're generating diverse enough models?
  • How can we choose rules so that highly-rated models have values that live in a "human-scale" ontology of the world rather than fundamental physics or cosmology?
  • What are the key things we want to achieve with our feedback and supervision given to the value learning AI (in combination with starting rules that are informed by our preferences)? Can we achieve this given large amounts of data from real humans? How? Are there more practical approaches that would let us use labeled data efficiently to get what we want?
  • How should this AI use the inferred values to make plans? There are lots of sub-questions here. For example, should it try to make plans using the value-laden models, or only using its native ontology?
  • How should we do translation between ontologies?
  • What judgments about which of our revealed preferences are biases and which are "true" preferences should we try to put into the initial conditions of a value learning process?
  • How should we elicit higher-order preferences from human models, and how should we translate them into changes in the parameters of the value learning procedure?
  • How do we reconcile the AI's learned higher-order preferences with the fact that it may have access to actions that affect its own hardware?
  • Which parts of the above list are generally useful for figuring out value learning, and which are specific to the sketch of a value learner I made up?

Some of these questions are part of groups that will all get answered at once by understanding general rules or arguments about what we want from superintelligent AI. Some of them are snowflakes that have to be considered individually. And some of them don't need to be answered at all, because they'll get replaced by different questions when we look at real-world value learning designs.

Is making progress on these questions doable? Certainly they're difficult, and they're a lot like questions in philosophy that people have a long history of beating their heads against to no avail. But I'm actually optimistic. First, empirically, I feel like our field, and me personally, have much better ideas about these problems than we did a decade ago (maybe this has something to do with naturalism being a decent framework for making philosophical progress). And second, methodologically, we get to do something in AI safety that you don't get to do in philosophy (at least not unilaterally), which is change the question if it's not helping. We're asking these questions not out of academic interest, but because we think they can help us build AI that does good things, and this real-world goal means everything impossible is irrelevant - if we ask for the impossible that doesn't mean we're doomed, it means we need to go back and ask for something less impossible.

 

Can it be practical?

Learning human values given infinite computing power is a nice party trick, but it's not ultimately what we're here for. We need to be able to translate progress on the conceptual problems of "how would we even build a good value learner with infinite resources" back to the real world of machine learning research.

I think that as conceptual progress is made, it's actually pretty easy to translate it back and get some benefits for practical AI safety. But also, it's not lost on me that I'm excited about how knowing what we're doing decreases the alignment tax, and yet as my primary "laboratory" I chose a situation with infinite alignment tax. It's predictably too slow to just take proposed solutions in the infinite-compute case and implement them approximately. Instead we have to take what we've learned about what we want from superintelligent AI in the "laboratory," and design machine learning systems that fulfill those wants.

Even before we solve the infinite-compute case, we can and should use knowledge about what we want from value learning to evaluate practical approaches. This might involve looking at current proposed alignment schemes and using idealized value learning as a lens to ask what they do well and what they do poorly. But perhaps more interestingly, it can also look like proposing experiments. Using ML models for value learning might raise questions about them that nobody's asked before, or suggest unusual things to try with them (e.g.).

The more progress we make, the more the line blurs between "experiments" and "preliminary attempts." If they go promisingly, this reinforces trust in this agenda, if they uncover practical difficulties this helps us learn more, and if they appear to have been doomed from conception this decreases trust in this agenda.

All this said, I get the feeling that my model of how this translation works isn't good enough. I suspect there are better ways to conceptualize this section, or even the whole agenda, if better understood the feedback connecting conceptual and practical progress.

 

What does this research process look like?

From the outside, this research doesn't look like much - it looks like reading widely, thinking, writing in plain language, and talking with people. Only gradually, as we better know what we want to test, will it incorporate more theorem-proving, coding, and interaction with experiment. And then eventually it wants to transition to science communication and philosophy-free engineering problems.

It's absolutely no surprise that I'm proposing this on the Alignment Forum - the resemblance to Eliezer-style "unbounded analysis" is not coincidental. I can't tell you in this post what theorems I want to prove, because the current phase of the research is figuring out what theorems to prove.

Am I proposing that only I do this research, nobody else need apply? Gosh, I hope not. And yet if I had minions (e.g. grad students), I'd have a hard time figuring out what to tell them to do, indicating that scaling is hard - this is different from a straightforward engineering problem where we can map out how we'll build something in advance and easily split up the work. Maybe it would help if I tried to boil my process down into a "game" anyone can play (taking inspiration from Eliciting Latent Knowledge).

Here's a first try, which would probably benefit from playtesting:

  1. Take a promising value learning scheme, and describe it in enough detail for the game to work.
  2. Generate a list of miracles that need to happen in this scheme.
  3. Go through the list and try to find miracles that you can do one of the following to (and explain why):
    1. Write down a logical solution.
    2. Come up with a machine learning solution.
    3. Clarify or make easier by reasoning about higher level reasons why we wanted to ask for this miracle in the first place.
    4. Show how this miracle is inconsistent with what we can get in our universe and start thinking about what we should ask for instead.
    5. Break down this miracle into smaller pieces.
  4. After making progress, end up with a slightly different description of the value learning scheme and list of miracles.

One problem with this game (though not unique to it) is that the steps are still big; "Step 1: think of a promising solution to the whole problem" is a real dilly. Following the form of the game might help people work together, but it's not going to prevent new researchers from e.g. getting stuck in the mire of philosophy.

There are two lodestars I'd like to re-emphasize: First, read a lot, and try to engage with the community. No matter whether you're a beginner or an old hand, staying abreast of current thought can help make your promising value learning schemes more promising, the list of miracles you spot more clear-sighted, the transformations you make more plausible and the explanations you give more relevant to other peoples' interests.

Second, keep asking yourself "What do we really want from AI?" and "What does that mean in the real world?" The first use is to keep the intended application in mind - even while you're off in the clouds thinking about the problem of radical translation, you should be thinking about the problem of radical translation to solve AI problems. The second use is to be scrupulously naturalistic. When you start to talk about a Platonic essence, or whatever else that can't be run on silicon, this is a golden opportunity because it means you've found a mistake in how you're formulating the problem, and when you figure out how to fix it you'll be closer to a solution. You still have to not fool yourself, but this makes it easier.

 

How is this approach dangerous?

There's a sort of danger that actually solving value learning gives off, that you don't feel from the strategy of merely tinkering with AI until it learns values pretty well. The danger is because it's easier for us to fool ourselves, it requires the success of longer chains of reasoning, and it means giving up control earlier in the takeoff. This is not to say that other plans don't have their own serious problems, but we should also pay attention to the problems with value learning.

A lot of the unsolved problems here are conceptual (e.g. do we try to avoid self-referential stopping criteria, and if so, how?), and anything even close to the philosophy genre is in danger of not getting solved or, worse, getting solved wrongly but confidently. And then when trying to communicate the results, you might be plagued with the same issues that prevent philosophy from reaching consensus easily. We can only hope for patience, competent elites, and close ties to empirical work.

I'm unsure whether fooling ourselves is extra problematic for value learning because it's less robust to mistakes than approaches that require a ton of labeled data. If we think of robustness in terms of the number of accidental sign errors needed to turn a good outcome into an awful outcome, approaches that explicitly model human values are less robust to sign flips. But if we consider mistakes earlier and more complicated than sign flips, it's not so clear that value learning is more vulnerable.

An advantage of solving value learning is that it lets you go fast, because you don't need as much slow human oversight. But going fast is dangerous because it gives you less time to react, and less oversight, if you merely think you've solved value learning. In this case, though, I think our intuition may overestimate the danger. First because letting go of direct control can be dangerous-feeling even if it's justified. And second because if we go dangerously fast, it's probably because we think going slow would be even worse.

 

Appendix: What's this similar to?

This research plan is similar to the Stuart Russell school of alignment (Hadfield-Menell et al. 2016, Shah et al. 2020). Don't assume you know what humans want, and in fact don't even start out knowing how you want to model humans (in some technical sense that looks like "also learn higher-order preferences about how humans model themselves and want to be modeled"). We want an aligned AI to use some looser set of starting assumptions that lets it learn how humans want humans to be modeled.

The other obvious resemblance is Stuart Armstrong's work (2019), which has a lot of the same philosophical underpinnings, but where I'm more interested in the learning part of the plan.

Although I put a lot of weight on the importance of higher-order preferences, in terms of learning them (though not in terms of expressing them without looping through the environment) they're not all that different from the rest of human preferences. Therefore there's a lot of similarity between what experiments I want to see and what's being done in terms of preference learning at Allen Institute for AI, Anthropic, or Redwood Research.

This post was also influenced by John Wentworth's recent post in the same research-plan genre.

New Comment


4 comments, sorted by Click to highlight new comments since:

Nice post, helps me get a better overview of the current state of value learning.

One small note: I wouldn't call everything we don't know how to do yet a miracle, but only stuff where we think it is quite unlikely that it is possible (though maybe it's just me and others think your "miracle" terminology is ok).

Right, I'm just saying "miracle" (some other people say "magic") as a reminder that we should keep things we don't know how to do in focus.

  • What judgments about which of our revealed preferences are biases and which are "true" preferences should we try to put into the initial conditions of a value learning process?
  • How should we elicit higher-order preferences from human models, and how should we translate them into changes in the parameters of the value learning procedure?

I realize being able to infer preferences from human actions could be a really useful ability. I'm curious how important you think those are vs. direct conversations with AI about values? (Or do you consider conversations simply a subset of the data that an AI could infer preferences from?)

It occurred to me that since language models are getting pretty good now, there is interesting opportunity now for starting to experiment with such human-and-AI conversations about values. This thought led me to have the following dialogue with the OpenAI Codex just now:

[Me:]

Here are some things I value:

- I like exercise
- I like getting to work / do something productive during the day
- I like going for a walk at the end of the day
- I like having dinner with my wife in the evening
- I like getting enough sleep

What are my values?

[Codex:]

- I like staying active and being productive
- I like spending time with my wife and family
- I like getting a good night's sleep

[Me:]

Ok that's pretty good. But I remembered I also like watching a show before bed. So what are my values now?

[Codex:]

- I like staying active, being productive, and spending time with my wife and family
- I like getting a good night's sleep

[Me:]

But you forgot the part about watching a show before bed...

[Codex:]

- I like staying active, being productive, spending time with my wife and family, and watching a show before bed

[Me:]

Ok that sounds good

Personally I feel more comfortable with the idea of an AGI that I could stop at any time to have a conversation to clarify what my values are like this. I imagine a lot could be clarified about higher-order preferences this way too, although sometimes the AI might need to read between the lines (e.g. "I know before I said I want to quit smoking but now I really want a cigarette!").

Do you agree that this is a mode of operation we should be aiming for? Perhaps there are issues with this seeming reasonable for toy examples like the above, but we need to be worrying about a future where the AGI is dealing with issues so sophisticated or removed from our everyday experience that we won't know how to give it guidance? 

Overall I found your research plan clear and easy to follow. It was a bit long (as it probably should be) so it took me awhile to find the time to read it thoroughly. I've always been attracted to the value learning / CIRL / etc. general direction of research and find it intuitively compelling, so I'm excited to see what comes of your work here.

If we end up having a separation between initial "seed" meta-preferences and the general data used to learn human preferences, then I think it's possible that part of the seed could be text that gets interpreted as authoritative about our preferences in a privileged sort of way (at least at first).

I'm not sure a conversation is quite the right picture, though. There might be some active learning on the AI's part, but that won't have to flow like a human conversation and it's going to be augmented by a lot of feedback from interpretability tools.