When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
Analogy
Consider a human assistant who is trying their hardest to do what H wants.
I’d say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I’d say we’ve solved the alignment problem.
“Aligned” doesn’t mean “perfect:”
- They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time.
- They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect.
- They may not know everything about H’s preferences, and so fail to recognize that a particular side effect is bad.
- They may build an unaligned AI (while attempting to build an aligned AI).
I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won’t make them more aligned.
(For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can’t recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.)
Clarifications
- The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I’d call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true.
- An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn’t part of my definition of alignment except insofar as it’s part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.
- An aligned AI would also be trying to do what H wants with respect to clarifying H’s preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask.
- This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we’ve made significant progress.
- One reason the definition is imprecise is that it’s unclear how to apply the concepts of “intention,” “incentive,” or “motive” to an AI system. One naive approach would be to equate the incentives of an ML system with the objective it was optimized for, but this seems to be a mistake. For example, humans are optimized for reproductive fitness, but it is wrong to say that a human is incentivized to maximize reproductive fitness.
- “What H wants” is even more problematic than “trying.” Clarifying what this expression means, and how to operationalize it in a way that could be used to inform an AI’s behavior, is part of the alignment problem. Without additional clarity on this concept, we will not be able to build an AI that tries to do what H wants it to do.
Postscript on terminological history
I originally described this problem as part of “the AI control problem,” following Nick Bostrom’s usage in Superintelligence, and used “the alignment problem” to mean “understanding how to build AI systems that share human preferences/values” (which would include efforts to clarify human preferences/values).
I adopted the new terminology after some people expressed concern with “the control problem.” There is also a slight difference in meaning: the control problem is about coping with the possibility that an AI would have different preferences from its operator. Alignment is a particular approach to that problem, namely avoiding the preference divergence altogether (so excluding techniques like “put the AI in a really secure box so it can’t cause any trouble”). There currently seems to be a tentative consensus in favor of this approach to the control problem.
I don’t have a strong view about whether “alignment” should refer to this problem or to something different. I do think that some term needs to refer to this problem, to separate it from other problems like “understanding what humans want,” “solving philosophy,” etc.
This post was originally published here on 7th April 2018.
The next post in this sequence will post on Saturday, and will be "An Unaligned Benchmark" by Paul Christiano.
Tomorrow's AI Alignment Sequences post will be the first in a short new sequence of technical exercises from Scott Garrabrant.
Longer form of my opinion:
Metaphilosophy is hard, and we need to solve it eventually. This might happen by default, i.e. if we simply build a well-motivated AI without thinking about metaphilosophy and without running any social interventions designed to get the AI's operators to think about metaphilosophy, humanity might still realize that metaphilosophy needs to be solved, and then goes ahead and solves it. I'm quite unsure right now whether or not it will happen by default.
However, in the world where the AI's operators don't agree that we need to solve metaphilosophy, I am very pessimistic about the AI realizing that it should help us with metaphilosophy and doing so. The one way I could imagine it happening is by programming in the right utility function (not even learning it, since if you learn it then you probably learn that metaphilosophy doesn't need to be solved), which seems hopelessly doomed. It seems really hard to make an AI system where you can predict in advance that it will help us solve metaphilosophy regardless of the operator's wishes.
In the world where the AI's operators do agree that we need to solve metaphilosophy, I think we're in a much better position. A background assumption I have is that humans motivated to solve metaphilosophy will be able to do so given enough time -- I share Paul's intuition that humans who no longer have to worry about food, water, shelter, disease, etc. could deliberate for a long time and make progress. In that case, a well-motivated AI would be fine -- it would stay deferential, perhaps learn more things in order to be more competent, and does things we ask it to do, which might include helping us in our deliberation by bringing up arguments we hadn't considered yet. (And note a well-motivated AI should only bring up arguments it believes are true, or likely to be true.)
I've laid out two extreme ways the world could be, and of course there's a spectrum between them. But thinking about the extremes makes me think of this not as a part of AI alignment, but as a social coordination problem, that is, we need to have humanity (especially the AI's operators) agree that metaphilosophy is hard and needs to be solved. I'd support interventions that make this more likely, eg. more public writing that talks about what we do after AGI, or about the possibility of a Great Deliberation before using the cosmic endowment, etc. If we succeed at that and building a well-motivated AI system, I think that would be sufficient.
I mean something more like "don't do things that a human wouldn't do, that seem crazy from a human perspective". I'm not suggesting that the AI has a perfect understanding of what "irreversible" and "high-impact" mean. But it should be able to predict what things a human would find crazy for which it should probably get the human's approval before doing the thing. (As an analogy, most employees have a sense of what it is okay for them to take initiative on, vs. what they should get their manager's approval for.)
Yeah, I more mean something like "continuation of the status quo" rather than "irreversible high-impact", as TurnTrout talks about below.
I am not sure. I think it is relatively easy to look back at how we have responded to similar events in the past and notice that something is amiss -- for example, it seems relatively easy for an AGI to figure out that power corrupts and that humanity has not liked it when that happened, or that many humans don't like it when you take advantage of their motivational systems, and so to at least not be confident in the actions you mention. On the other hand, there may be similar types of events in the future that we can't back out by looking at the past. I don't know how to deal with these sorts of unknown unknowns.
I think sufficiently narrow AI systems have essentially no hope of solving or avoiding these problems in general, regardless of safety techniques we develop, and so in the short term to avoid these problems you want to intervene on the humans who are deploying AI systems.
Yeah, looking back I don't like that reason, I think I had an intuition that it wasn't an urgent problem and wanted to jot a quick sentence to that effect and the sentence came out wrong.
One reason it might not be urgent is because we need to aim for competitiveness anyway -- our AI systems need to be competitive so that economic incentives don't cause us to use unaligned variants.
We can also aim to have the world mostly run by aligned AI systems rather than unaligned ones, which would mean that there isn't much opportunity for us to be manipulated. You might have the intuition that even one unaligned AI could successfully manipulate everyone's values, and so we would still need the aligned AI systems to be able to defend against that. I'm not sure where I stand on that -- it seems possible to me that this is just very hard to do, especially when there are aligned superintelligent systems that would by default put a stop to it if they find out about it.
But really I'm just confused on this topic and would need to think more about it.