Ultimately, our goal is to build AI systems that do what we want them to do. One way of decomposing this is first to define the behavior that we want from an AI system, and then to figure out how to obtain that behavior, which we might call the definition-optimization decomposition. Ambitious value learning aims to solve the definition subproblem. I interpret this post as proposing a different decomposition of the overall problem. One subproblem is how to build an AI system that is trying to do what we want, and the second subproblem is how to make the AI competent enough that it actually does what we want. I like this motivation-competence decomposition for a few reasons:
This is a great comment, and maybe it should even be its own post. It clarified a bunch of things for me, and I think was the best concise argument for "we should try to build something that doesn't look like an expected utility maximizer" that I've read so far.
I agree with habryka that this is a really good explanation. I also agree with most of your pros and cons, but for me another major con is that this decomposition moves some problems that I think are crucial and urgent out of "AI alignment" and into the "competence" part, with the implicit or explicit implication that they are not as important, for example the problem of obtaining or helping humans to obtain a better understanding of their values and defending their values against manipulation from other AIs.
In other words, the motivation-competence decomposition seems potentially very useful to me as a way to break down a larger problem into smaller parts so it can be solved more easily, but I don't agree that the urgent/not-urgent divide lines up neatly with the motivation/competence divide.
Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition), even using "AI alignment" in a context where it's clear that I'm using Paul's definition gives me the feeling that I'm implicitly agreeing to his understanding of how various subproblems should be prioritized.
Aside from the practical issue of confusion between different usages of "AI alignment" (I think others like MIRI had been using "AI alignment" in a broader sense before Paul came up with his narrower definition)
I switched to this usage of AI alignment in 2017, after an email thread involving many MIRI people where Rob suggested using "AI alignment" to refer to what Bostrom calls the "second principal-agent problem" (he objected to my use of "control"). I think I misunderstood what Rob intended in that discussion, but my definition is meant to be in line with that---if the agent is trying to do what the principal wants, it seem like you've solved the principal-agent problem. I think the main way this definition is narrower than what was discussed in that email thread is by excluding things like boxing.
In practice, essentially all of MIRI's work seems to fit within this narrower definition, so I'm not too concerned at the moment with this practical issue (I don't know of any work MIRI feels strongly about that doesn't fit in this definition). We had a thread about this after it came up on LW in April, where we...
I’m happy to clarify my opinion if it seems useful to you—perhaps my way of stating things will click where Paul’s way didn’t
I would highly welcome that. BTW if you see me argue with Paul in the future (or in the past) and I seem to be not getting something, please feel free to jump in and explain it a different way. I often find it easier to understand one of Paul's ideas from someone else's explanation.
it seems likely to be easy to get a well motivated AI system to realize that it should help us understand our values
Yes, that seems easy, but actually helping seems much harder.
and that it should not do irreversible high-impact actions until then
How do you determine what is "high-impact" before you have a utility function? Even "reversible" is relative to a utility function, right? It doesn't mean that you literally can reverse all the consequences of an action, but rather that you can reverse the impact of that action on your utility?
It seems to me that "avoid irreversible high-impact actions" would only work if one had a small amount of uncertainty over one's utility function, in which case you could just avoid actions that are considered "irreversible high-impact" by
...How to prevent "aligned" AIs from unintentionally corrupting human values? We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, so their value systems no longer give sensible answers. (Sort of the AI assisted version of the classic "power corrupts" problem.) AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. Even in the course of trying to figure out how the world could be made better for us, they could in effect be searching for adversarial examples on our value functions. Finally, at our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.
My position on this (that might be clear...
In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.
Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.
In contrast, I will argue that the "motivation-competence
...This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.
Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.
The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).
If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.
It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.
Now, I do believe that if you set up the prior correctly then i
...I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.
Okay, so there seem to be two disagreements:
The first one seems primarily about our disagreements on the utility of theory, which I'll get to later.
For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely (maybe the first few AGIs don't think about simulations; maybe it's impossible to construct such a convincing hypothesis). I especially don't see the argument that it is more likely than the failure mode in which a goal-directed AGI is optimizing for something different from what humans want.
(You might respond that intent alignment brings risk down from say 10% to 3%, whereas your agenda brings risk down from 10% to 1%. My response would be that once we have successfully figured out...
For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely.
First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.
For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?
It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual
...I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
Worth noting here that (it looks like) Paul eventually settled upon "intent alignment" as the term for this.
I hadn't realized this post was nominated, partially because of my comment, so here's a late review. I basically continue to agree with everything I wrote then, and I continue to like this post for those reasons, and so I support including it in the LW Review.
Since writing the comment, I've come across another argument for thinking about intent alignment -- it seems like a "generalization" of assistance games / CIRL, which itself seems like a formalization of an aligned agent in a toy setting. In assistance games, the agent explicitly maintains a distribution over possible human reward functions, and instrumentally gathers information about human preferences by interacting with the human. With intent alignment, since the agent is trying to help the human, we expect the agent to instrumentally maintain a belief over what the human cares about, and gather information to refine this belief. We might hope that there are ways to achieve intent alignment that instrumentally incentivizes all the nice behaviors of assistance games, without requiring the modeling assumptions that CIRL does (e.g. that the human has a fixed known reward function).
Changes I'd make...
I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
I think it would be helpful for understanding your position and what you mean by "AI alignment" to have a list or summary of those other subproblems and why you think they're much less urgent. Can you link to or give one here?
Also, do you have a prefered term for the broader definition, or the de re reading? What should we call those things if not "AI alignment"?
Is there a concept of a safe partially aligned AI? Where it recognizes its own limitations of understanding of the human[-ity] and limit its actions to what it knows is within those limits with high probability?
Nominating this primarily for Rohin’s comment on the post, which was very illuminating.
Crystallized my view of what the "core problem" is (as I explained in a comment on this post). I think I had intuitions of this form before, but at the very least this post clarified them.
Thank you Paul, this post clarifies many open points related to AI (inner) alignment, including some of its limits!
I recently described a technique called control vectors to force a LLM model to show specific dispositional traits, in order to condition some form of alignment (but definitely not true alignment).
I'd happy to be challenged! In my opinion, the importance of control vectors is definitely underestimated for AI safety. https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
I do agree that AI, who is underdeveloped in terms of its goals and allowed to exist, is too likely to become an ethical and/or existential catastrophe, but have a few questions.
I'm not tech. savvy and am well aware that maybe it's a lack of understanding that lets me live without fear of AI but it seems an important issue round here and I would like to have some understanding. And a little understanding of my perspective - I grew up in shadow of the Cold War i.e. mutually assured destruction in 6 minutes or less (it might have been 12 minutes - I can't quite remember anymore).
This post caught my eye on the review list.
I need to clarify something before reading forward.
getting your AI to try to do the right th...
Are there any plans to generalize this kind of alignment later to include CEV or some other plausible metaethics, or should this be "the final stop"?
When I say an AI A is aligned with an operator H, I mean:
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Analogy
Consider a human assistant who is trying their hardest to do what H wants.
I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem.
“Aligned” doesn’t mean “perfect:”
I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won’t make them more aligned.
(For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can’t recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.)
Clarifications
Postscript on terminological history
I originally described this problem as part of “the AI control problem,” following Nick Bostrom’s usage in Superintelligence, and used “the alignment problem” to mean “understanding how to build AI systems that share human preferences/values” (which would include efforts to clarify human preferences/values).
I adopted the new terminology after some people expressed concern with “the control problem.” There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like “put the AI in a really secure box so it can’t cause any trouble”). There currently seems to be a tentative consensus in favor of this approach to the control problem.
I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
This post was originally published here on 7th April 2018.
The next post in this sequence will post on Saturday, and will be "An Unaligned Benchmark" by Paul Christiano.
Tomorrow's AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.