Google DeepMind: An Approach to Technical AGI Safety and Security

[-]Zach Stein-Perlman6mo*Ω91413

I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks.

But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work. [Edit: I think the same about Anthropic.]

[-]Knight Lee6mo32

Given that buy-in from leadership is a bigger bottleneck than the safety team's work, what would you do differently if you were in charge of the safety team?

[-]Nathan Helm-Burger6mo52

I've been thinking about this, especially since Rohon has been bringing it up frequently in recent months.

I think there are potentially win-win alignment-and-capabilities advances which can be sought. I think having a purity-based "keep-my-own-hands-clean" mentality around avoiding anything that helps capabilities is a failure mode of AI safety reseachers.

Win-win solutions are much more likely to actually get deployed, thus have higher expected value.

[-]Zach Stein-Perlman6mo50

I don't know, maybe nothing. (I just meant that on current margins, maybe the quality of the safety team's plans isn't super important.)

[-]Haiku6mo111

On a first reading of this summary, I take it as a small positive update on the quality of near-term AI Safety thinking at DeepMind. It treads familiar ground and doesn't raise any major red flags for me.

The two things I would most like to know are:

Will DeepMind commit to halting frontier AGI research if they cannot provide a robust safety case for systems that are estimated to have a non-trivial chance of significantly harming humanity?
Does the safety team have veto power on development or deployment of systems they deem to be unsafe?

A "yes" to both would be a very positive surprise to me, and would lead me to additionally ask DeepMind to publicly support a global treaty that implements these bare-minimum policies in a way that can be verified and enforced.

A "no" to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind's own actions.

[-]ryan_greenblatt6mo73

A "no" to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind's own actions.

I think it's probably possible greatly improve safety given a moderate budget for safety and not nearly enough buy in for (1) and (2). (At least not enough buy-in prior to a large incident which threatens to be very costly for the organization.)

Overall, I think high quality thinking about AI safety seems quite useful even if this level of buy-in is unlikely.

(I don't think this report should update us much about having buy-in needed for (1)/(2), but the fact that it could be published at all in it's current form is still encouraging.)

[-]Michael Thiessen6moΩ161

The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately.

In amplified oversight, any question that is too hard to supervise directly is systematically reduced to ones that we hypothesize can be supervised. However, humans may be systematically biased even for fairly simple questions. If this turns out to be a problem in practice, we could seek to model these deviations and automatically correct or account for them when interpreting the oversight.

I genuinely don't understand how anybody expects Amplified Oversight to work, it seems to me to be assuming alignment in so many places, like being able to give the judge or debate partner the goal of actually trying to get to the truth. I also don't envy the humans trying to figure out which side of these debates is in the right. We can't even do that for very obvious existing human debates.

Even if debate-like approaches work in theory for getting a signal as to something's truthiness, I don't see how it could apply to agents which aren't simply generating a 100% complete plan for the debate and human approval to then mechanically enact. Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate? Or is the idea you simulate an agent and at random points in it carrying out its task you run a debate on what it's currently doing and use this to train the model to always act in ways that would allow it to win a debate?

I don't even think whether debate will work is really an empirical question. I don't know that there's any study result on current systems that could convince me this would carry on working through to AGI/ASI. Even if things looked super optimistic now, like that refuting a lie is easier than lying, I would not expect this to hold up. I don't expect the lines on these graphs to be straight.

I know a complete plan is unreasonable at this point, but can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?

[-]Rohin Shah6moΩ350

like being able to give the judge or debate partner the goal of actually trying to get to the truth

The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.

Despite all the caveats, I think it's wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate?

You don't have to stop the agent, you can just do it afterwards.

can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?

Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.

(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you're outlining here.)

[-]Michael Thiessen6moΩ010

(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)

(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)

If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:

One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.

I don't think we have any fully aligned models, and won't any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart's law definitely applies here.

You don't have to stop the agent, you can just do it afterwards.

Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:

This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.

Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don't need to stop and involve a human? I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.

From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it'll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn't work, and I think it'll only work less well as AI becomes smarter and more agentic.

[-]Rohin Shah6moΩ340

(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)

I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)

Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.

and goodhart's law definitely applies here.

I am having a hard time parsing this as having more content than "something could go wrong while bootstrapping". What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?

Is this intended only as a auditing mechanism, not a prevention mechanism

Yeah I'd expect debates to be an auditing mechanism if used at deployment time.

I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.

Any alignment approach will always be subject to the critique "what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses". I'm not trying to be robust to that critique.

I'm not saying I don't worry about fooling the cheap system -- I agree that's a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than "what if it didn't work".

The problem is RLHF already doesn't work

??? RLHF does work currently? What makes you think it doesn't work currently?

[-]Michael Thiessen6moΩ030

RLHF does work currently? What makes you think it doesn't work currently?

This is definitely the crux so probably really the only point worth debating.

RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely continuations that will score well on its task. I don't think RLHF does anything remotely analogous to making it care about whether it's following your intent, or following a constitution, etc.

You're definitely aware of the misalignment that still exists with our current RLHF'd models and have read the recent papers on alignment faking etc. so I probably can't make an argument you haven't already heard.

Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don't.

[-]Knight Lee6mo52

Great work, it's wonderful to know that Google DeepMind isn't dismissing AI risk, and is willing to allocate at least some resources towards it.

When you have lists of problems and lists of solutions, can you quantify them by their relative importances? A very rough guess might be better than nothing.

Without knowing the relative importances, it's hard to know if we agree or disagree, since we might both say "I believe in X, Y and Z, they are very important," without realizing that one of us thinks the importance is 80% X, 10% Y, 10% Z, while the other one thinks it's 5% X, 10% Y, 85% Z.

^{^}

Severe risks require a radically different mitigation approach than most other risks, as we cannot wait to observe them happening in the wild, before deciding how to mitigate them. For this reason, we limit our scope in this paper to severe risks, even though other risks are also important.

^{^}

Even simply automating current human capabilities, and then running them at much larger scales and speeds, could lead to results that are qualitatively beyond what humans accomplish. Besides this, AI systems could have native interfaces to high-quality representations of domains that we do not understand natively: AlphaFold (Jumper et al., 2021) has an intuitive understanding of proteins that far exceeds that of any human.

^{^}

This definition relies on a very expansive notion of what it means to know something, and was chosen for simplicity and brevity. A more precise but complex definition can be found in Section 4.2. For example, we include cases where the model has learned an “instinctive” bias, or where training taught the model to “honestly believe” that the developer’s beliefs are wrong and actually some harmful action is good, as well as cases where the model causes harm without attempting to hide it from humans. This makes misalignment (as we define it) a broader category than related notions like deception.

^{^}

By restricting misalignment to cases where the AI knowingly goes against the developer’s intent, we are considering a specific subset of the wide set of concerns that have been considered a part of alignment in the literature. For example, most harms from failures of multi-multi alignment (Critch and Krueger, 2020) would be primarily structural risks in our terminology.

^{^}

Although this is framed more from a traditional reward model training perspective, the same applies to language-based reward models or constitutional methods, as well as to thinking models. In all cases, we ask what tasks the model needs to be exposed to in order to produce aligned actions in every new situation.

LESSWRONG
LW

LESSWRONG
LW

73

Google DeepMind: An Approach to Technical AGI Safety and Security

73

Ω 41

73

Ω 41

Extended Abstract

Background assumptions

Risk areas

Misuse

Risk assessment

Mitigations

Assurance against misuse

Misalignment

Training an aligned model

Defending against a misaligned model

Enabling stronger defenses

Alignment assurance

Limitations

Contents