I constructed my AI alignment research agenda piece by piece, stumbling around in the dark and going down many false and true avenues.

But now it is increasingly starting to feel natural to me, and indeed, somewhat inevitable.

What do I mean with that? Well, let's look at the problem in reverse. Suppose we had an AI that was aligned with human values/preferences. How would you expect that to have been developed? I see four natural paths:

  1. Effective proxy methods. For example, Paul's amplification and distillation, or variants of revealed preferences, or a similar approach. The point of this that it reaches alignment without defining what a preference fundamentally is; instead it uses some proxy for the preference to do the job.
  2. Corrigibility: the AI is safe and corrigible, and along with active human guidance, manages to reach a tolerable outcome.
  3. Something new: a bold new method that works, for reasons we haven't thought of today (this includes most strains of moral realism).
  4. An actual grounded definition of human preferences.

So, if we focus on scenario 4, we need a few things. We need a fundamental definition of what a human preference is (since we know this can't be defined purely from behaviour). We need a method of combining contradictory and underdefined human preferences. We also need a method for taking into account human meta-preferences. And both these methods has to actually reach an output, and not get caught in loops.

If those are the requirements, then it's obvious why we need most of the elements of my research agenda, or something similar. We don't need the exact methods sketched out there, there may be other way of synthesising preferences and meta-preferences together. But the overall structure - a way of defining preferences, and ways of combining them that produce an output - seems, in retrospect, inevitable. The rest is, to some extent, just implementation details.

New Comment
25 comments, sorted by Click to highlight new comments since:

I really hope people get this point, since I so far feel like I've failed to convince many people actively working in AI of the need for (4). I know lots of people who agree that (4) is needed because it's been my position for a while, but they are also mostly outsiders because it's not been the major focus in AI safety work. I'm hopeful your work is an important step in building the movement to work on (4) and for work on it to be taken seriously (even as recently as this weekend at EA Global SF 2019 I encountered serious AI safety researchers who were dismissive of the need to work on (4)).

So that context in mind, I think it's worth asking what is needed to elevate (4) to a question many will work on and consider important? Maybe with your work we are now over the threshold, and I'd love to see evidence of that in the coming months, but if we are not, what should we be working on so that (4) can take off and not be a fringe topic in the minds of those building AI systems?

I encountered serious AI safety researchers who were dismissive of the need to work on (4)

The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn't dismiss work on (4), but it doesn't seem like the highest priority given this argument.

One potential counterargument is that the AI must look like an expected utility maximizer due to coherence arguments, and so we need to figure out the utility function, but I don't buy this argument.

The argument against (4) is that the AI will be able to figure out our preferences since it is superintelligent, so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest. I wouldn't dismiss work on (4), but it doesn't seem like the highest priority given this argument.

I mostly view this as a kind of passing-the-buck argument in a way that is potentially dangerous, because it assumes we can keep a superintelligent AI aligned long enough that it can figure out (4) in a way we would be happy with. We maybe don't have to solve literally all of (4) and everything connected to it, but it would seem we have to solve enough of it that we can reasonable trust a system we build will stay aligned with us as it works out the rest of the details for us.

I think of this as something like, we need an adequate theory of values, preferences, metapreferences, volition, or whatever you want to call it such that we can evaluate if some AI would, if we asked it to, work out (4) in a way that satisfies that theory and us, and this reduces back to having a theory of values that satisfies us well enough that it's capturing everything we care about in principle (even if we can't, say, work out exactly what it implies about specific behaviors in all scenarios) in a way we would endorse after operating on more data.

To make an analogy, I think we need enough values "math" to be sure we get something that hangs together as well as basing math on category theory rather than naive set theory does and doesn't result in unexpected gaps and incompletenesses or inconsistencies that only become apparent much later after working out the theory further. Basically I'm saying where we stand on understanding values today is about where we stood 120 years ago in mathematical foundations: we have something that kind of works, but we can tell it will probably blow up in our faces if we push it to its limits, and I suspect we need something that won't do that if we're to be able to adequately assess whether or not we can safely hand off the remainder of the task to an AI because without that there are likely to be leaps of faith in our proofs that doing that would be safe the same way in the past people had to put up sign posts that said "just don't go over here" in order to get on with a theory of mathematical foundations.

The thing you're describing is a theory of human preferences, not (4): An actual grounded definition of human preferences (which implies that in addition to the theory we need to run some computation that produces some representation of human preferences). I was mostly arguing against requiring an actual grounded definition of human preferences.

I am unsure on the question of whether it is necessary to have a theory of human preferences or values. I agree that such a theory would help us evaluate whether or not a particular AI agent is going to be aligned or not. But how much does it help? I can certainly see other paths that don't require it. For example, if we had a theory of optimization and agents, and a method of "pointing" optimization power at humans so that the AI is "trying to help the human", I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)

Ah, sounds like we were working with different definitions of "definition".

For example, if we had a theory of optimization and agents, and a method of "pointing" optimization power at humans so that the AI is "trying to help the human", I could imagine feeling confident enough to turn on that AI system. (It obviously depends on the details.)

That still seems dangerous to me, since I see no reason to believe it wouldn't end up optimizing for something we didn't want. I guess you would have a theory of optimization and agents so good you could know that it wouldn't optimize in ways you didn't want it to, but I think this also begs the question by hiding details in "want" that would ultimately require a sufficient theory of human preferences.

As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don't because we don't, on the other side, have a slam dunk argument for why we wouldn't end up needing it, and I'd rather live in a world where we worked it out and didn't need it than one where we didn't work it out and do need it.

That still seems dangerous to me, since I see no reason to believe it wouldn't end up optimizing for something we didn't want. I guess you would have a theory of optimization and agents so good you could know that it wouldn't optimize in ways you didn't want it to

In my head, the theory + implementation ensures that all of the optimization is pointed toward the goal "try to help the human". If you could then legitimately say "it could still end up optimizing for something else", then we don't have the right theory + implementation as I'm imagining it.

but I think this also begs the question by hiding details in "want" that would ultimately require a sufficient theory of human preferences.

I think it's hiding details in "optimization", "try" and "help" (and to a lesser extent, "human"). I don't think it's hiding details in "want". You could maybe argue that any operationalization of "help" would necessarily have "want" as a prerequisite, but this doesn't seem obvious to me.

You could also argue that any beneficial future requires us to figure out our preferences, but that wouldn't explain why it had to happen before building superintelligent AI.

As I often say, the reason I think we need to prioritize a theory of human preferences is not because I have a slam dunk proof that we need it, but because I believe we fail to adequately work to mitigate known risks of superintelligent AI if we don't because we don't, on the other side, have a slam dunk argument for why we wouldn't end up needing it, and I'd rather live in a world where we worked it out and didn't need it than one where we didn't work it out and do need it.

I agree with this, but it's not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?

I agree with this, but it's not an argument on the margin. There are many aspects of AI safety I could work on. Why a theory of human preferences in particular, as opposed to e.g. detecting optimization?

How we think of whether to work on one thing versus another seems a matter of both how important the project is overall and how likely it is that any individual can do something about it relative to their ability to do something about something else. That is, I don't think of human researchers as a commodity, thus much of the answer to this question is about what can any one of us do not just what could we do if we were all equally capable of working on anything.

I think of this as a function of impact, neglectedness, tractability, skill, and interest. The first three hold constant across all people, but the last two vary. So when I choose what to work on given that I am willed to work on AI safety, I decide also based on what I am most skilled to do relative to others (my comparative advantage) and what I am most interested or excited about working on (to what extent am I excited to work on something for its own sake, irrelevant of the amount it advances AI safety).

Weighing the first three together suggests whether or not it is worth anyone working on a problem, and including the last two point to whether it's worth you working on a problem. So when you ask about what "I could work on" all those factors are playing a role.

But if we go back to just the question of why I think a theory of human preferences is impactful, neglected, and tractable, we can take these in turn.

First, I think it's pretty clear that developing a theory of human preferences precise enough to be useful in AI alignment is neglected. Stuart aside, everyone else I can think of thinking about this is still in the proto-formal stage, trying to figuring out what the formalisms would even look like to begin making precise statements to allow forming the theory. Yes, we have some theories about human preferences already, but they seem inadequate for AI safety purposes.

Second, I think we have reason to believe it's tractable in that no one has previously needed a theory of human preferences accurate enough to address the needs presented by AI alignment, so we haven't spent much time really trying to solve the problem of "precise theory of human values that doesn't fall apart under heavy optimization". The closest we have is the theory of preferences from behavioral economics, which I view Stuart's work as an evolution of, and that was able to be worked out over a few decades once markets became a powerful enough optimization force that it was valuable to have it so that we could get more of what we wanted from markets and other modern transactional means of interaction. Yes, this time is different, since we are working ahead of the optimization force being present in our lives, and we are doing that for reasons relevant to the last point, impact.

On the question of impact, I think Stuart addressed this question well several months ago.

I also have an intuition that thinking that we can get away without having an adequate theory of human preferences is just the same mistake we made 20 years ago when folks argued that we didn't have to worry about safety at all because a sufficiently intelligent AI would be smart enough to be nice to us, i.e. that sufficient intelligence would result in sufficient ethics. Now of course we know this as the orthogonality thesis that the two are not correlated, and I think of trying to build powerful optimizers without adequate understanding of what we want them to optimize for (in this case human preferences) as making the same mistake we made early on, thinking we will easily solve one very different problem by solving another.

All of this suggests to me that we should be making space for some people to work on a theory of human preferences, just like we make space for some people to work on agent foundations and some people to work on alignment in the context of current ML systems. There's no one overseeing AI safety as a Manhattan-style project, so the natural way to organize ourselves is not around driving towards a single objective, but of driving towards a goal that we might approach by many means and, lacking clear consensus, it is worth pursuing these many means as a hedge against the likelihood that we are wrong, and we are not resourced to the frontier such that we must make tradeoffs against different directions so much as we can just expand the frontier by bringing in more people to work on AI safety.

I agree with basically all of this; maybe I'm more pessimistic about tractability, but not enough to matter for any actual decision.

It sounds to me that given these beliefs the thing you would want to advocate is "let those who want to figure out a theory of human preferences do so and don't shun them from AI safety". Perhaps also "let's have some introductory articles for such a theory so that new entrants to the field know that it is a problem that could use more work and can make an informed decision about what to work on". Both of these I would certainly agree with.

In your original comment it sounded to me like you were advocating something stronger: that a theory of human preferences was necessary for AI safety, and (by implication) at least some of us who don't work on it should switch to working on it. In addition, we should differentially encourage newer entrants to the field to work on a theory of human preferences, rather than some other problem of AI safety, so as to build a community around (4). I would disagree with these stronger claims.

Do you perhaps only endorse the first paragraph and not the second?


I endorse what you propose in the first paragraph. I do think a theory of human preferences is necessary and that at least someone should work on it (and if I didn't think this I probably wouldn't be doing it myself), although not necessarily that someone should switch to it all else equal, and I wouldn't say we should encourage folks to work on it more than other problems as a general policy since there's a lot to be done and I remain uncertain about prioritization so can't make a strong recommendation there beyond "let's make sure we don't fail to work on as much as seems relevant as possible".

So it sounds like we only disagree on the necessity aspect, and that seems to be the result of an inferential gap I'm not sure how to bridge yet, i.e. why it is I believe it to be necessary hinges in part on deeper beliefs we may not share and haven't figured out to make explicit. That's good to know, because it points towards something worth thinking about and addressing so that existing and new entrants to AI safety work may more accept it as important and useful work.

so all we need to do is ensure that it is incentivized to figure out and satisfy our preferences, and then it will do the rest.

That's actually what I'm aiming at with the research agenda, but the Occam's razor argument shows that this itself is highly non-trivial, and we need some strong grounding of the definition of preference.

There's a difference between "creating an explicit preference learning system" and "having a generally capable system learn preferences". I think the former is difficult (because of the Occam's razor argument) but the latter is not.

Suppose I told you that we built a superintelligent AI system without thinking at all about grounded human preferences. Do you think that AI system doesn't "know" what humans would want it to do, even if it doesn't optimize for it? (See also this failed utopia story.)

Do you think that AI system doesn't "know" what humans would want, even if it doesn't optimize for it?

I think the AI would not know that, because "what humans would want" is not defined. "What humans say they want", "what, upon reflection, humans would agree they want...", etc can be done, but "what humans want" is not a defined things about the world or about humans - without extra assumptions (which cannot be deduced from observation).

[-]ShmiΩ140

I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that "An actual grounded definition of human preferences" is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,

My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it's a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.

I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart's as easily as I've forgotten there because we don't have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I've had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.

(Yes, there is Eliezer's work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like "oh, we can just do CEV and be done with this, so let's worry about the real problems".)

I agree there is a risk it is an impossible problem, and I actually think it's quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don't work rather than building towards things that clearly do. I hope I'm being too pessimistic.

[-]VaniverΩ3130

In my experience, people mostly haven't had the view of "we can just do CEV, it'll be fine" and instead have had the view of "before we figure out what our preferences are, which is an inherently political and messy question, let's figure out how to load any preferences at all."

It seems like there needs to be some interplay here--"what we can load" informs "what shape we should force our preferences into" and "what shape our preferences actually are" informs "what loading needs to be capable of to count as aligned."

I wouldn't say it's neglected, just that people are busy laying foundation and that it's probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)

[-]dxuΩ020
1 and 2 are hard to succeed at without making a lot of progress on 4

It's not obvious to me why this ought to be the case. Could you elaborate?

Even if we succeeded at (1), it would be hard to know that we'd succeeded without progress on (4). If we're using one or more proxies, we don't have a way to talk about how accurate they are without (4) - we can't evaluate how closely the proxies match the thing they're supposed to proxy, without grounding that thing.

For (2), if we want to talk about "low-impact" or anything like it, then we need a grounding of what kind of impact we care about - and that question falls under (4). If we forget about some kind of impact that humans actually do care about, then we're in trouble.

Yep ^_^ I make those points in the research agenda (section 3).

Exactly. You explained it better than I could :)

I also am curious why this should be so.

I also continue to disagree with Stuart on low impact in particular being intractable without learning human values.

To be precise: I argue low impact is intractable without learning a subset of human values; the full set is not needed.

Thanks for clarifying! I haven't brought this up on your research agenda because I prefer to have the discussion during an upcoming sequence of mine, and it felt unfair to comment on your agenda, "I disagree but I won't elaborate right now".

Here's what I imagine a solution looks like: you have some giant finite element model. There may or may not be some agenty subsystems embedded in it. You pass it in to solution.py, and out pops a list of susbsystems with approximately-agenty behavior, along with approximate utility functions and range of validity for each subsystem.

Does it need a method of combining contradictory preferences etc? Only if it's actually constructing preferences as a distinct step to begin with. It could, for instance, just directly formulate "subsystem with utility function" as a functional equation, and then look for subsystems which approximately satisfy that functional equation within some region. That would implicitly combine contradictory preferences and underdefined preferences and so forth, but "combining preferences" wouldn't necessarily be a very useful way to think about it.

The hard part isn't synthesizing a utility function from preferences; the hard part is figuring out which part of the system to draw a box around, and what it means for that subsystem to have "preferences". Which part of the system even has preferences to begin with, and what's the physical manifestation of those preferences? By the time all that is worked out, it's entirely plausible that "preferences" won't even be a useful intermediate abstraction to think about.

The hard part isn't synthesizing a utility function from preferences; the hard part is figuring out which part of the system to draw a box around, and what it means for that subsystem to have "preferences". Which part of the system even has preferences to begin with, and what's the physical manifestation of those preferences? By the time all that is worked out, it's entirely plausible that "preferences" won't even be a useful intermediate abstraction to think about.

This is exactly the issue I've been concerning myself with lately: I think preferences as we typically model them are not a natural category and are instead better thought of as a complex illusion over some more primitive operation. I suspect it's something like error minimization and homeostasis, but that's just a working guess and I endeavor to be more confused before I become less confused.

Nonetheless, I also appreciate Stuart's work here formalizing this model in enough detail that maybe we can use it as a well-known starting point to build from, much as other theories that ultimately aren't quite right were right enough to get people working in the right part of problem/solution space.