Comment Permalink

Answer by rmoehnFeb 15, 202030

IDA includes looking inside the overseen agent: ‘As described here, we would like to augment this oversight by allowing Bⁿ⁻¹ to view the internal state of Aⁿ.’ (ALBA: An explicit proposal for aligned AI) If we can get enough information out of that internal state, we can avoid inner misalignment. This, however, is difficult and written about in The informed oversight problem.

See in context

7

[ Question ]

Does iterated amplification tackle the inner alignment problem?

by JanB

15th Feb 2020

AI Alignment Forum

1 min read

A

3 4

7 Ω 4

When iterated distillation and amplification (IDA) was published, some people described it described as "the first comprehensive proposal for training safe AI". Having read a bit more about it, it seems that IDA is mainly a proposal for outer alignment and doesn't deal with the inner alignment problem at all. Am I missing something?

Inner AlignmentIterated Amplification

Frontpage

7 Ω 4

New Answer

New Comment

3 Answers sorted by
top scoring

evhub

Feb 15, 2020

Ω10130

You are correct that amplification is primarily a proposal for how to solve outer alignment, not inner alignment. That being said, Paul has previously talked about how you might solve inner alignment in an amplification-style setting. For an up-to-date, comprehensive analysis of how to do something like that, see “Relaxed adversarial training for inner alignment.”

Ofer

Feb 16, 2020

Ω230

My understanding is that amplification-based approaches are meant to tackle inner alignment by using the amplified systems that are already trusted (e.g. humans + many invocations of a trusted model) to mitigate inner alignment problems in the next (slightly more powerful) models that are being trained. A few approaches for this have already been suggested (I'm not aware of published empirical results), see Evan's comment for some pointers.

I hope a lot more research will be done on this topic. It's not clear to me whether we should expect to have amplified systems that allow us to mitigate inner alignment risks to a satisfactory extent before the point where we have x-risk posing systems, how can we make that more likely, and if it's not feasible how do we realize that as soon as possible?

rmoehn

Feb 15, 2020

30

IDA includes looking inside the overseen agent: ‘As described here, we would like to augment this oversight by allowing Bⁿ⁻¹ to view the internal state of Aⁿ.’ (ALBA: An explicit proposal for aligned AI) If we can get enough information out of that internal state, we can avoid inner misalignment. This, however, is difficult and written about in The informed oversight problem.

7

[ Question ]

Does iterated amplification tackle the inner alignment problem?

7

Ω 4

7

Ω 4

3 Answers sorted by top scoring

Feb 15, 2020

Feb 16, 2020

Feb 15, 2020

3 Answers sorted by
top scoring