This is a great post! I know there's been lots of conversations here and elsewhere about this topic, often going for dozens of comments, and I felt like a lot of them needed summarising else they'd be lost to history. Thanks for summarising them briefly and linking back to them.
Thanks! Yeah, one of my motivations for this post is that I was losing track of these discussions myself and falling back into confusion that was already cleared up. For example, after reading one of Paul's latest clarifications, I had a strong feeling that he had told me that already on a previous occasion, but I couldn't remember when. Another push came from my discussion with Raymond Arnold (Raemon) about distillation where we talked about how it's weird to summarize a debate/disagreement as one of the participants, and it kind of made me realize that summarizing resolved confusions has less of this problem.
I'm not sure how "resolved" this confusion is, but I've gone back and forth a few times on what's the core reason(s) that we're supposed to expect IDA to create systems that won't do anything catastrophic: (1) because we're starting with human imitation / human approval which is safe, and the amplification step won't make it unsafe? (2) because "Corrigibility marks out a broad basin of attraction"? (3) because we're going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?
For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it's (1), but Paul replied that it was really (3), if I'm reading it correctly..?
act-based = based on short-term preferences-on-reflection
For others who were confused about what "short-term preferences-on-reflection" would mean, I found this comment and its reply to be helpful.
Putting it into my own words: short-term preferences-on-reflection are about what you would want to happen in the near term, if you had a long time to think about it.
By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.
By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.
Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.
The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source
I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").
I still feel confused about "distill ≈ RL". In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like "In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with" and "Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption".
Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing "distill ≈ RL" is still accurate?
ETA: From the FAQ for Paul's agenda:
1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?
Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).
and:
The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition.
At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).
I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.
Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It's what I would want, not what I do want.
An AI optimising for my short-term preferences may reasonably say "No, don't take this action, because you'd actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you're just mistaken about them." This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.
My understanding is that Paul never meant to introduce the term "narrow preferences" (i.e. "narrow" is not an adjective that applies to preferences), and the fact that he talked about narrow preferences in the act-based agents post was an accident/something he no longer endorses.
Instead, when Paul says "narrow", he's talking not about preferences but about narrow vs ambitious value learning. This is what Paul means when he says "I've only ever used [the term "narrow"] in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning."
See also this comment and the ambitious vs narrow value learning post.
Oh, okay. Is it not important to have a name for the class of thing we could accidentally train an ML system to optimise for that isn't our ultimate preferences? Is there a term for that?
I think Paul calls that "preferences-as-elicited", so if we're talking about act-based agents, it would be "short-term preferences-as-elicited" (see this comment).
Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.
I note that Wei says a similar thing happened to 'act-based':
My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based".
Is there a reason why the standard terms are not being used to refer to the standard, short-term results?
(I suppose that economics assumes rational agents who know their preferences, so taking language from economics might lead to this situation with the 'short-term preferences' decision.)
In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.
Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.
I agree this is confusing.
Is there a reason why the standard terms are not being used to refer to the standard, short-term results?
As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.
In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.
I think current=elicited=stated, but actual≈reflective (because there is the possibility that undergoing reflection isn't a good way to find out our actual preferences, or as Paul says 'There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.')
As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.
I agree this example adds nuance, and I'm unsure how to correctly categorise it.
You have a section titled
learning user preferences for corrigibility isn't enough for corrigible behavior
Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?
I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.
AI Alignment is a confusing topic in general, but even compared to other alignment topics, IDA seems especially confusing. Some of it is surely just due to the nature of communicating subtle and unfinished research ideas, but other confusions can be cleared up with more specific language or additional explanations. To help people avoid some of the confusions I or others fell into in the past while trying to understand IDA (and to remind myself about them in the future), I came up with this list of past confusions that I think have mostly been resolved at this point. (However there's some chance that I'm still confused about some of these issues and just don't realize it. I've included references to the original discussions where I think the confusions were cleared up so you can judge for yourself.)
I will try to maintain this list as a public reference so please provide your own resolved confusions in the comments.
alignment = intent alignment
At some point Paul started using "alignment" refer to the top-level problem that he is trying to solve, and this problem is narrower (i.e., leaves more safety problems to be solved elsewhere) than the problem that other people were using "alignment" to describe. He eventually settled upon "intent alignment" as the formal term to describe his narrower problem, but occasionally still uses just "aligned" or "alignment" as shorthand for it. Source
short-term preferences ≠ narrow preferences
At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so). Source
preferences = "actual" preferences (e.g., preferences-on-reflection)
When Paul talks about preferences he usually means "actual" preferences (for example the preferences someone would arrive at after having a long time to think about it while having access to helpful AI assistants, if that's a good way to find someone's "actual" preferences). He does not mean their current revealed preferences or the preferences they would state or endorse now if you were to ask them. Source
corrigibility ≠ based on short-term preferences
I had misunderstood Paul to be using "corrigibility to X" as synonymous with "based on X's short-term preferences". Actually "based on X's short-term preferences" is a way to achieve corrigibility to X, because X's short-term preferences likely includes "be corrigible to X" as a preference. "Corrigibility" itself means something like "allows X to modify the agent" or a generalization of this concept. Source
act-based = based on short-term preferences-on-reflection
My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based". Source
act-based corrigibility
Evan Hubinger used "act-based corrigibility" to mean both a method of achieving corrigibility (based on short-term preferences) and the kind of corrigibility achieved by that method. (I'm not sure if he still endorses using the term this way.) Source
learning user preferences for corrigibility isn't enough for corrigible behavior
Because an act-based agent is about "actual" preferences not "current" preferences, it may be incorrigible even if it correctly learns that the user currently prefers the agent to be corrigible, if it incorrectly infers or extrapolates the user's "actual" preferences, or if the user's "actual" preferences do not actually include corrigibility as a preference. (ETA: Although in the latter case presumably the "actual" preferences include something even better than corrigibility.) Source
distill ≈ RL
Summaries of IDA often describe the "distill" step as using supervised learning, but Paul and others working on IDA today usually have RL in mind for that step. Source
outer alignment problem exists? = yes
The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source
corrigible to the user? ≈ no
IDA is typically described as being corrigible to the user. But in reality it would be trying to satisfy a combination of preferences coming from the end user, the AI developer/overseer, and even law enforcement or other government agencies. I think this means that "corrigible to the user" is very misleading, because the AI is actually not likely to respect the user's preferences to modify (most aspects of) the AI or to be "in control" of the AI. Sources: this comment and a talk by Paul at an AI safety workshop
strategy stealing ≠ literally stealing strategies
When Paul says "strategy stealing" he doesn't mean observing and copying someone else's strategy. It's a term borrowed from game theory that he's using to refer to coming up with strategies that are as effective as someone else's strategy in terms of gaining resources and other forms of flexible influence. Source