List of resolved confusions about IDA

Wei Dai

97 List of resolved confusions about IDA

30th Sep 2019

3 min read

97 Ω 38

AI Alignment is a confusing topic in general, but even compared to other alignment topics, IDA seems especially confusing. Some of it is surely just due to the nature of communicating subtle and unfinished research ideas, but other confusions can be cleared up with more specific language or additional explanations. To help people avoid some of the confusions I or others fell into in the past while trying to understand IDA (and to remind myself about them in the future), I came up with this list of past confusions that I think have mostly been resolved at this point. (However there's some chance that I'm still confused about some of these issues and just don't realize it. I've included references to the original discussions where I think the confusions were cleared up so you can judge for yourself.)

I will try to maintain this list as a public reference so please provide your own resolved confusions in the comments.

alignment = intent alignment

At some point Paul started using "alignment" refer to the top-level problem that he is trying to solve, and this problem is narrower (i.e., leaves more safety problems to be solved elsewhere) than the problem that other people were using "alignment" to describe. He eventually settled upon "intent alignment" as the formal term to describe his narrower problem, but occasionally still uses just "aligned" or "alignment" as shorthand for it. Source

short-term preferences ≠ narrow preferences

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so). Source

preferences = "actual" preferences (e.g., preferences-on-reflection)

When Paul talks about preferences he usually means "actual" preferences (for example the preferences someone would arrive at after having a long time to think about it while having access to helpful AI assistants, if that's a good way to find someone's "actual" preferences). He does not mean their current revealed preferences or the preferences they would state or endorse now if you were to ask them. Source

corrigibility ≠ based on short-term preferences

I had misunderstood Paul to be using "corrigibility to X" as synonymous with "based on X's short-term preferences". Actually "based on X's short-term preferences" is a way to achieve corrigibility to X, because X's short-term preferences likely includes "be corrigible to X" as a preference. "Corrigibility" itself means something like "allows X to modify the agent" or a generalization of this concept. Source

act-based = based on short-term preferences-on-reflection

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based". Source

act-based corrigibility

Evan Hubinger used "act-based corrigibility" to mean both a method of achieving corrigibility (based on short-term preferences) and the kind of corrigibility achieved by that method. (I'm not sure if he still endorses using the term this way.) Source

learning user preferences for corrigibility isn't enough for corrigible behavior

Because an act-based agent is about "actual" preferences not "current" preferences, it may be incorrigible even if it correctly learns that the user currently prefers the agent to be corrigible, if it incorrectly infers or extrapolates the user's "actual" preferences, or if the user's "actual" preferences do not actually include corrigibility as a preference. (ETA: Although in the latter case presumably the "actual" preferences include something even better than corrigibility.) Source

distill ≈ RL

Summaries of IDA often describe the "distill" step as using supervised learning, but Paul and others working on IDA today usually have RL in mind for that step. Source

outer alignment problem exists? = yes

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

corrigible to the user? ≈ no

IDA is typically described as being corrigible to the user. But in reality it would be trying to satisfy a combination of preferences coming from the end user, the AI developer/overseer, and even law enforcement or other government agencies. I think this means that "corrigible to the user" is very misleading, because the AI is actually not likely to respect the user's preferences to modify (most aspects of) the AI or to be "in control" of the AI. Sources: this comment and a talk by Paul at an AI safety workshop

strategy stealing ≠ literally stealing strategies

When Paul says "strategy stealing" he doesn't mean observing and copying someone else's strategy. It's a term borrowed from game theory that he's using to refer to coming up with strategies that are as effective as someone else's strategy in terms of gaining resources and other forms of flexible influence. Source

Outer AlignmentUpdated Beliefs (examples thereof)

Curated

97 Ω 38

Mentioned in

126AI Alignment 2018-19 Review

19How does iterated amplification exceed human abilities?

17[AN #67]: Creating environments in which to study inner alignment failures

List of resolved confusions about IDA

New Comment

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:38 AM

[-]Ben Pace6yΩ10250

This is a great post! I know there's been lots of conversations here and elsewhere about this topic, often going for dozens of comments, and I felt like a lot of them needed summarising else they'd be lost to history. Thanks for summarising them briefly and linking back to them.

[-]Wei Dai6yΩ10230

Thanks! Yeah, one of my motivations for this post is that I was losing track of these discussions myself and falling back into confusion that was already cleared up. For example, after reading one of Paul's latest clarifications, I had a strong feeling that he had told me that already on a previous occasion, but I couldn't remember when. Another push came from my discussion with Raymond Arnold (Raemon) about distillation where we talked about how it's weird to summarize a debate/disagreement as one of the participants, and it kind of made me realize that summarizing resolved confusions has less of this problem.

[-]Ben Pace6yΩ3110

Curated.

[-]Steven Byrnes6y180

I'm not sure how "resolved" this confusion is, but I've gone back and forth a few times on what's the core reason(s) that we're supposed to expect IDA to create systems that won't do anything catastrophic: (1) because we're starting with human imitation / human approval which is safe, and the amplification step won't make it unsafe? (2) because "Corrigibility marks out a broad basin of attraction"? (3) because we're going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?

For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it's (1), but Paul replied that it was really (3), if I'm reading it correctly..?

[-]ESRogs6y140

act-based = based on short-term preferences-on-reflection

For others who were confused about what "short-term preferences-on-reflection" would mean, I found this comment and its reply to be helpful.

Putting it into my own words: short-term preferences-on-reflection are about what you would want to happen in the near term, if you had a long time to think about it.

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

[-]paulfchristiano6y130

By way of illustration, AlphaZero's long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.

Short-term preferences are the value function one or a few moves out. If the algorithm is "reasonable," then its short-term preference-on-reflection are the true function P(I win the game|I make this move). You could also talk about intermediate degrees of reflection.

[-]riceissa6y80

I used to think that after the initial distillation step, the AI would be basically human-level. Now I understand that after the initial distillation step, the AI will be superhuman in some respects and subhuman in others, but wouldn't be "basically human" in any sense. Source

[-]Ofer6yΩ350

The existing literature on IDA (including a post about "reward engineering") seems to have neglected to describe an outer alignment problem associated with using RL for distillation. (Analogous problems may also exist if using other ML techniques such as SL.) Source

I'm confused about what outer alignment problems might exist when using supervised learning for distillation (though maybe this is just due to me using an incorrect/narrower interpretation of "outer alignment problems" or "using supervised learning for distillation").

[-]riceissa6y*50

I still feel confused about "distill ≈ RL". In RL+Imitation (which I assume is also talking about distillation, and which was written after Semi-supervised reinforcement learning), Paul says things like "In the same way that we can reason about AI control by taking as given a powerful RL system or powerful generative modeling, we could take as given a powerful solution to RL+imitation. I think that this is probably a better assumption to work with" and "Going forward, I’ll preferentially design AI control schemes using imitation+RL rather than imitation, episodic RL, or some other assumption".

Was there a later place where Paul went back to just RL? Or is RL+Imitation about something other than distillation? Or is the imitation part such a small contribution that writing "distill ≈ RL" is still accurate?

ETA: From the FAQ for Paul's agenda:

1.2.2: OK, so given this amplified aligned agent, how do you get the distilled agent?

Train a new agent via some combination of imitation learning (predicting the actions of the amplified aligned agent), semi-supervised reinforcement learning (where the amplified aligned agent helps specify the reward), and techniques for optimizing robustness (e.g. creating red teams that generate scenarios that incentivize subversion).

and:

The imitation learning is more about getting this new agent off the ground than about ensuring alignment. The bulk of the alignment guarantee comes from the semi-supervised reinforcement learning, where we train it to work on a wide range of tasks and answer questions about its cognition.

[-]Ben Pace6yΩ120

At some point Paul used "short-term preferences" and "narrow preferences" interchangeably, but no longer does (or at least no longer endorses doing so).

I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.

short-term preferences = short-term preferences-on-reflection ≠ narrow preferences

Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It's what I would want, not what I do want.

An AI optimising for my short-term preferences may reasonably say "No, don't take this action, because you'd actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you're just mistaken about them." This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.

[-]riceissa6y40

My understanding is that Paul never meant to introduce the term "narrow preferences" (i.e. "narrow" is not an adjective that applies to preferences), and the fact that he talked about narrow preferences in the act-based agents post was an accident/something he no longer endorses.

Instead, when Paul says "narrow", he's talking not about preferences but about narrow vs ambitious value learning. This is what Paul means when he says "I've only ever used [the term "narrow"] in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning."

[-]Ben Pace6y20

Oh, okay. Is it not important to have a name for the class of thing we could accidentally train an ML system to optimise for that isn't our ultimate preferences? Is there a term for that?

[-]riceissa6y10

I think Paul calls that "preferences-as-elicited", so if we're talking about act-based agents, it would be "short-term preferences-as-elicited" (see this comment).

[-]Ben Pace6y60

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I note that Wei says a similar thing happened to 'act-based':

My understanding is that "act-based agent" used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone's short-term preferences-on-reflection, even though that no longer seems particularly "act-based".

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

(I suppose that economics assumes rational agents who know their preferences, so taking language from economics might lead to this situation with the 'short-term preferences' decision.)

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

[-]riceissa6y50

Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.

I agree this is confusing.

Is there a reason why the standard terms are not being used to refer to the standard, short-term results?

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

In the post Wei contrasts "current" and "actual" preferences. "Stated" vs "reflective" preferences also seem like nice alternatives too.

I think current=elicited=stated, but actual≈reflective (because there is the possibility that undergoing reflection isn't a good way to find out our actual preferences, or as Paul says 'There’s a hypothesis that “what I’d say after some particular idealized process of reflection” is a reasonable way to capture “actual preferences,” but I think that’s up for debate—e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.')

[-]Ben Pace6y40

As far as I know, Paul hasn't explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like "help me stay in control and be well-informed" do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.

I agree this example adds nuance, and I'm unsure how to correctly categorise it.

[-]Ben Pace6yΩ240

You have a section titled

learning user preferences for corrigibility isn't enough for corrigible behavior

Would this be more consistently titled "Learning narrow preferences for corrigibility isn't enough for corrigible behavior"?

[-]Ben Pace6yΩ120

I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.

Moderation Log