We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours.

Who are we?

We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. 

What have we been up to?

It’s been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. 

Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. 

Frontier Safety

The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models.

FSF

We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc).

We’re excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. 

A key area of the FSF we’re focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions.

Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence.

Dangerous Capability Evaluations

Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluations published so far, and to the best of our knowledge has informed the design of evaluations at other organizations. We regularly run and report these evaluations on our frontier models, including Gemini 1.0 (original paper), Gemini 1.5 (see Section 9.5.2), and Gemma 2 (see Section 7.4). We’re especially happy to have helped develop open sourcing norms through our Gemma 2 evals. We take pride in currently setting the bar on transparency around evaluations and implementation of the FSF, and we hope to see other labs adopt a similar approach.

Prior to that we set the stage with Model evaluation for extreme risks, which set out the basic principles behind dangerous capability evaluation, and also talked more holistically about designing evaluations across present day harms to extreme risks in Holistic Safety and Responsibility Evaluations of Advanced AI Models.

Mechanistic Interpretability

Mechanistic interpretability is an important part of our safety strategy, and lately we’ve focused deeply on Sparse AutoEncoders (SAEs). We released Gated SAEs and JumpReLU SAEs, new architectures for SAEs that substantially improved the Pareto frontier of reconstruction loss vs sparsity. Both papers rigorously evaluate the architecture change by running a blinded study evaluating how interpretable the resulting features are, showing no degradation. Incidentally, Gated SAEs was the first public work that we know of to scale and rigorously evaluate SAEs on LLMs with over a billion parameters (Gemma-7B).

We’ve also been really excited to train and release Gemma Scope, an open, comprehensive suite of SAEs for Gemma 2 2B and 9B (every layer and every sublayer). We believe Gemma 2 sits at the sweet spot of “small enough that academics can work with them relatively easily” and “large enough that they show interesting high-level behaviors to investigate with interpretability techniques”. We hope this will make Gemma 2 the go-to models of choice for academic/external mech interp research, and enable more ambitious interpretability research outside of industry labs. You can access Gemma Scope here, and there’s an interactive demo of Gemma Scope, courtesy of Neuronpedia.

You can also see a series of short blog posts on smaller bits of research in the team’s progress update in April.

Prior to SAEs, we worked on:

  • Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla: The key contribution here was to show that the circuit analysis techniques used in smaller models scaled: we gained significant understanding about how, after Chinchilla (70B) “knows” the answer to a multiple choice question, it maps that to the letter corresponding to that answer.
  • Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level: While this work didn’t reach its ambitious goal of mechanistically understanding how facts are computed in superposition in early MLP layers, it did provide further evidence that superposition is happening, and falsified some simple hypotheses about how factual recall might work. It also provided guidelines for future work in the area, such as viewing the early layers as producing a “multi-token embedding” that is relatively independent of prior context.
  • AtP∗: An efficient and scalable method for localizing LLM behaviour to components: A crucial aspect of circuit discovery is finding which components of the model are important for the behavior under investigation. Activation patching is the principled approach, but requires a separate pass for each component (comparable to training a model), whereas attribution patching is an approximation, but can be done for every component simultaneously with two forward & one backward pass. This paper investigated attribution patching, diagnosed two problems and fixed them, and showed that the resulting AtP* algorithm is an impressively good approximation to full activation patching.
  • Tracr: Compiled Transformers as a Laboratory for Interpretability: Enabled us to create Transformer weights where we know the ground truth answer about what the model is doing, allowing it to serve as a test case for our interpretability tools. We’ve seen a few cases where people used Tracr, but it hasn’t had as much use as we’d hoped for, because Tracr-produced models are quite different from models trained in the wild. (This was a known risk at the time the work was done, but we hoped it wouldn’t be too large a downside.)

Amplified Oversight

Our amplified oversight work aims to provide supervision on any single situation that is as close as possible to that of a human with complete understanding of all of the reasons that the AI system produced its output - including when the AI has a very broad range of superhuman capabilities. (The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

Theoretical Work on Debate

On the theoretical side, the original debate protocol enables a polynomial-time verifier to decide any problem in PSPACE given debates between optimal debaters. But our AI systems are not optimal, and we should not assume they are! It doesn't matter if an optimal AI could refute lies, if the AI systems we train in practice cannot do so. The problem of obfuscated arguments is exactly when a dishonest debater lies by breaking an easy problem down into hard subproblems that an optimal honest AI could answer but a bounded one could not.

Doubly-efficient debate provides a new debate protocol that enables a polynomial-time honest strategy to prove facts to an even more limited judge, even against an unbounded dishonest strategy. This is not quite yet what we want: the honest strategy is only polynomial-time in the length of the human-judgeable argument, whereas we would like it to be efficient in terms of the length of the AI-judgeable argument. We have some work in progress that we hope will cross this gap, and we expect that if we do cross the gap this will influence which protocols we study in our empirical work.

Empirical Work on Debate

On the empirical side, we ran inference-only experiments with debate that help challenge what the community expects. First, on tasks with information asymmetry, theory suggests that debate should be close to as good as (or even better than) giving the judge access to the full information, whereas in these inference-only experiments debate performs significantly worse. Second, on tasks without information asymmetry, weak judge models with access to debates don’t outperform weak judge model without debate. Third, we find only limited evidence that stronger debaters lead to much higher judge accuracy – and we really need to make this be the case for debate to succeed in the long run.

Qualitatively, our sense is that these issues occur because the models are not very good at judging debates: the actual debater arguments seem quite good. Our current work is looking into training our LLM judges to be better proxies of human judges, after which we plan to try finetuning the debaters using the debate protocol, and checking that this closes the gaps we’ve observed.

Causal Alignment

A long-running stream of research in our team explores how understanding causal incentives can contribute to designing safe AI systems. Causality gives us pretty general tools for understanding what agents that are ‘trying’ to achieve goals will do, and provides explanations for how they act. We developed algorithms for discovering agents, which can help us identify which parts of systems can be understood through an agent-lens. In principle, this could allow us to empirically discover goal-directed agents, and determine what they are optimizing for.

We have also shown that causal world models are a key aspect of agent robustness, suggesting that some causal tools are likely to apply to any sufficiently powerful agent. The paper got an Honorable Mention for Best Paper at ICLR 2024. This work continues to inform the development of safety mitigations that work by managing an agent’s incentives, such as methods based on process supervision. It can also be used to design consistency checks that look at long-run behavior of agents in environments, extending the more short-horizon consistency checks we have today.

Emerging Topics

We also do research that isn’t necessarily part of a years-long agenda, but is instead tackling one particular question, or investigating an area to see whether it should become one of our longer-term agendas. This has led to a few different papers:

One alignment hope that people have (or at least had in late 2022) is that there are only a few “truth-like” features in LLMs, and that we can enumerate them all and find the one that corresponds to the “model’s beliefs”, and use that to create an honest AI system. In Challenges with unsupervised LLM knowledge discovery, we aimed to convincingly rebut this intuition by demonstrating a large variety of “truth-like” features (particularly features that model the beliefs of other agents). We didn’t quite hit that goal, likely because our LLM wasn’t strong enough to show such features, but we did show the existence of many salient features that had at least the negation consistency and confidence properties of truth-like features, which “tricked” several unsupervised knowledge discovery approaches.

Explaining grokking through circuit efficiency was a foray into “science of deep learning”. It tackles the question: in grokking, why does the network’s test performance improve dramatically upon continued training, despite having already achieved nearly perfect training performance? It gives a compelling answer to this question, and validates this answer by correctly predicting multiple novel phenomena in a similar setting. We hoped that better understanding of training dynamics would enable improved safety, but unfortunately that hope has mostly not panned out (though it is still possible that the insights would help with detection of new capabilities). We’ve decided not to invest more in “science of deep learning”, because there are other more promising things to do, but we remain excited about it and would love to see more research on it.

Power-seeking can be probable and predictive for trained agents is a short paper building on the power-seeking framework that shows how the risk argument would be made from the perspective of goal misgeneralization of a learned agent. It still assumes that the AI system is pursuing a goal, but specifies that the goal comes from a set of goals that are consistent with the behavior learned during training.

Highlights from Our Collaborations

We further broaden our portfolio by collaborating extensively with other teams at Google DeepMind, as well as external researchers.

Within Google, in addition to work that we do to inform the safe development of frontier models, we collaborate with our Ethics and Responsibility teams. In The Ethics of Advanced Assistants we helped consider the role of value alignment and issues around manipulation and persuasion as part of the ethical foundations for building artificial assistants. We further explored this with a focus on persuasion and a focus on justified trust.

Our researchers also do a lot of work with external collaborators and students. Sebastian Farquhar was lead author on Detecting hallucinations in large language models using semantic entropy, published in Nature and covered in roughly 500 newspapers and magazines. It explores a maximally simple case of ‘cross-examination’ showing how a model’s output consistency can be used to predict false answers. This work is now being taken forward by other teams in Google.

Many researchers on our team have mentored MATS scholars, including Arthur Conmy, Scott Emmons, Sebastian Farquhar, Victoria Krakovna, David Lindner, Neel Nanda, and Alex Turner, on wide-ranging topics including power-seeking, steering model behavior, or avoiding jailbreaks. Neel Nanda’s stream of interpretability work through MATS is especially prolific, producing more than ten papers including work on copy suppressioncheaply jailbreaking models using their internals, and interpreting attention layer outputs with SAEs.

Much of our work on Causal Incentives is done with external collaboration alongside members of the causal incentives working group including work on intention and instrumental goalsmitigating deception, and linking causal models to decision theories.

Lastly, we have collaborated externally to help produce benchmarks and evaluations of potentially unsafe behavior including contributing environments to OpenAI’s public suite of dangerous capability evaluations and contributing to the BASALT evaluations for RL with fuzzily specified goals.

What are we planning next?

Perhaps the most exciting and important project we are working on right now is revising our own high level approach to technical AGI safety. While our bets on frontier safety, interpretability, and amplified oversight are key aspects of this agenda, they do not necessarily add up to a systematic way of addressing risk. We’re mapping out a logical structure for technical misalignment risk, and using it to prioritize our research so that we better cover the set of challenges we need to overcome. 

As part of that, we’re drawing attention to important areas that require addressing. Even if amplified oversight worked perfectly, that is not clearly sufficient to ensure alignment. Under distribution shift, the AI system could behave in ways that amplified oversight wouldn’t endorse, as we have previously studied in goal misgeneralization. Addressing this will require investments in adversarial training, uncertainty estimation, monitoring, and more; we hope to evaluate these mitigations in part through the control framework.

We’re looking forward to sharing more of our thoughts with you when they are ready for feedback and discussion. Thank you for engaging and for holding us to high standards for our work, epistemics, and actions.

New Comment


33 comments, sorted by Click to highlight new comments since:

Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.   

I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:

When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results.

This isn’t a baseline of reasonable practices upon which better safety practices can evolve, so much as a baseline of zero—the fact that DeepMind will take some unspecified action, rather than none, if an evaluation triggers does not appear to me to be much of a safety strategy at all.

Which is especially concerning, given that DeepMind might create AGI within the next few years. If it’s too early to make commitments now, when do you expect that will change? What needs to happen, before it becomes reasonable to expect labs to make safety commitments? What sort of knowledge are you hoping to obtain, such that scaling becomes a demonstrably safe activity? Because without any sense of when these “best practices” might emerge, or how, the situation looks alarmingly much like the labs requesting a blank check for as long as they deem fit, and that seems pretty unacceptable to me. 

AI is a nascent field. But to my mind its nascency is what’s so concerning: we don’t know how to ensure these systems are safe, and the results could be catastrophic. This seems much more reason to not build AGI than it does to not offer any safety commitments. But if the labs are going to do the former, then I think they have a responsibility to provide more than a document of empty promises—they should be able to state, exactly, what would cause them to stop building these potentially lethal machines. So far no lab has succeeded at this, and I think that’s pretty unacceptable. If labs would like to take up the mantle of governing their own safety, then they owe humanity the service of meaningfully doing it.

Thanks for sharing this! Keep up the good work. I'm excited to hear about your new research directions/strategy!

Yay DeepMind safety humans for doing lots of (seemingly-)good safety work. I'm particularly happy with DeepMind's approach to creating and sharing dangerous capability evals.

Yay DeepMind for growing the safety teams substantially:

We’ve also been growing since our last post: by 39% last year, and by 37% so far this year.

What's the size of the AGI Alignment and Frontier Safety teams now?

It depends fairly significantly on how you draw the boundaries; I think anywhere between 30 and 50 is defensible. (For the growth numbers I chose one specific but arbitrary way of drawing the boundaries, I expect you'd get similar numbers using other methods of drawing the boundaries.) Note this does not include everyone working on safety, e.g. it doesn't include the people working on present day safety or adversarial robustness.

I took a look at the debate papers. I think that's a good angle to take, but they're missing some factors that sometimes make debates between humans fail.

  1. Humans and neural networks both have some implicit representation of probability distributions of output types. The basis behind "I can't explain why but that seems unlikely" can be more accurate than "here's an argument for why that will happen". You're basically delegating the problem of "making AI thinking explainable" to the AI itself, but if you could do that, you could just...make neural networks explainable, perhaps by asking an AI what another AI is doing and doing RLHF on the response. But that doesn't seem to work in general. In other words, the problem is that by using only the arguments output by NNs, those are weaker agents than NNs that don't have to go through production of arguments.

  2. Reasoning about probability distributions means argument branches can be of the type "X is a likely type of thing" vs "X is rare". And empirically checking the distribution can be too expensive. That makes the debate framework not work as well.

Strongly agree on the first challenge; on the theory workstream we're thinking about how to deal with this problem. Some past work (not from us) is here and here.

Though to be clear, I don't think the empirical evidence clearly rules out "just making neural networks explainable". Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven't been tried very much yet. I don't think "asking an AI what another AI is doing and doing RLHF on the response" is nearly as good; that is much more likely to lead to persuasive explanations that aren't correct.

I'm not that compelled by the second challenge yet (though I'm not sure I understand what you mean). My main question here is how the AI system knows that X is likely or that X is rare, and why it can't just explain that to the judge. E.g. if I want to argue that it is rare to find snow in Africa, I would point to weather data I can find online, or point to the fact that Africa is mostly near the Equator, I wouldn't try to go to different randomly sampled locations and times in Africa and measure whether or not I found snow there.

To clarify the 2nd point, here's an example. Suppose someone presents you with a large box that supposedly produces electricity endlessly. Your boss thinks it works, and you're debating the inventor in front of your boss.

"Perpetual motion machines are known to be impossible" you say, but your boss isn't familiar with that conceptual class or the reasoning involved.

The inventor says, "Here, let's plug in a thing, we can see that the box does in fact produce a little electricity." Your boss finds this very convincing.

The process proposed in the paper is something like, "let's randomly sample every possible machine to see if it does perpetual motion". So the inventor points to the sun and says, "that thing has been making energy continuously and never stops for as long as we've been able to tell". They point to some stars and say the same thing.

The sampling and evaluation is dependent on a conceptual framework that isn't agreed on, and waiting for the sun and stars to burn out isn't very practical.

There are several different outs to this example:

  • You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on "the box can make some electricity but not infinitely much".
  • You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.
    • This does not mean that the boss then understands why perpetual motion machines are impossible -- an important aspect of debate that it aims to produce good oversight of claims without giving the judge an understanding of those claims.
    • This particular approach will likely run into the problem of obfuscated arguments though.
  • The debaters are meant to be copies of the same AI, and to receive exactly the same information, with the hope that each knows what the other knows. In the example, this hopefully means that you understand how the inventor is tricking your boss, and you can simply point it out and explain it.
    • If the inventor legitimately believes the box produces infinite electricity, this won't work, but also I consider that out of scope for what debate needs to do. We're in the business of getting the best answer given the AI's knowledge, not the true answer.
    • If both you and the inventor know that the claim is impossible from theory, but don't know the local error that the inventor made, this won't work.
  • You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible. (Roughly speaking, cross-examination = wiping memory and asking a new question.)

The process proposed in the paper

Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.

You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.

Ah, I don't think you can. Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.) This seems like a core of our disagreement here.

You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible.

The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?

Which paper are you referring to? If you mean doubly efficient debate

Yes, "doubly efficient debate".

Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)

I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:

  1. A fundamental law of physics is conservation of energy: energy can neither be created nor destroyed, only transformed from one form to another.
  2. Electricity is a form of energy.
  3. This box does not have an infinite source of energy.
  4. The above three together imply that the box cannot produce infinite electricity.

The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn't mean the judge understands the other claims, just that the judge isn't addressing them when deciding who wins the overall debate.

If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like "the principle has been tested many times", "in the tests, confirming evidence outweighs the disconfirming evidence", "there is an overwhelming scientific consensus behind it", "there is significant a priori theoretical support" (assuming that's true), "given the above the reasonable conclusion is to have very high confidence in conservation of energy". Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.

If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren't accepted by any authorities the judge trusts, I still think it's probably doable with a larger debate, but it's harder for me to play out what the debate would look like because I don't know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that's not central.

The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?

That seems right, but why is it a problem?

The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).

If you want to disallow appeals to authority

I do, but more importantly, I want to disallow the judge understanding all the concepts here. Suppose the judge says to #1: "What is energy?" or "What is conservation?" and it can't be explained to them - what then?

Also, argument 1 isn't actually correct, E=mc^2 and so on.

That seems right, but why is it a problem? The honest strategy is fine under cross-examination, it will give consistent answers across contexts.

"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

I do, but more importantly, I want to disallow the judge understanding all the concepts here.

I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)

"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.

So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.

I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts

You don't? But this is a major problem in arguments between people. The variation within humans is already more than enough for this! There's a gap like that every 35 IQ points or so. I don't understand why you're confident this isn't an issue.

I guess we've found our main disagreement, at least?

So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.

Now you're training for multiple objectives:

  1. You want the debater AI to argue for proposition A or not-A according to its role and convince human judges of that.
  2. You want it to not change its position on sub-arguments.

But (2) is ill-defined. Can sub-arguments be combined for less weighting? Are they all worth the same? What if you have several sub-arguments that all depend on a single sub-2-argument? Good arguments for A or not-A should have lots of disagreements - or do you want to train AI that makes all the same sub-arguments for A or not-A and then says "this implies A / not-A"? I don't think this works.


In response to the linked "HCH" post:

Yes, an agent past some threshold can theoretically make a more-intelligent agent. But that doesn't say anything about alignment; the supposed "question-answering machine" would be subject to instrumental convergence and mesaoptimizer issues, and you'd get value drift with each HCH stage, just as you would with RSI schemes.

@bhauth @Rohin Shah I think that bhauth has an important point here about the danger of large gaps between judge and debaters. Similarly, between a trusted overseer and a smarter worker. Keeping the gaps small is really important for a lot of oversight plans to work out well!

Here's some research I am doing which I think answers this point thoroughly: it is possible to smoothly, continuously, incrementally scale-down the capabilities of a model by injecting carefully controlled amounts of noise into its activations. I'm calling this 'noise injection impairment'.

This removes the need to have precisely created a whole series of models with precise capability steps between each one. You can instead train a single strong model, and scale it all the way down to be just a tiny step above the next most strong model. Then you create as large a number of intermediate steps of capability as you need by reducing the noise magnitude.

Without this technique, then I believe bhauth's point would stand, and capability gaps between model versions would lead to dangerous failures of various control and monitoring schemes.

Link to details of ongoing research: https://www.apartresearch.com/project/sandbag-detection-through-model-degradation 

I think the basic idea of using more steps of smaller size is worth considering. Maybe it reduces overall drift, but I suspect it doesn't, because my view is:

Models have many basins of attraction for sub-elements. As model capability increases continuously, there are nearly-discrete points where aspects of the model jump from 1 basin to another, perhaps with cascading effect. I expect this to produce large gaps from small changes to models.

I'm not going to repeat all of the literature on debate here, but as brief pointers:

  • Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
  • AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE -- PSPACE-complete problems typically involve exponential-sized trees)
  • Cross-examination is discussed here
  • This paper discusses the experiments you'd do to figure out what the human judge should be doing to make debate more effective
  • The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn't find a place where they've been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.

(Also, the "arbitrary amounts of time and arbitrary amounts of explanation" was pretty central to my claim; human disagreements are way more bounded than that.)

The scope of our argument seems to have grown beyond what a single comment thread is suitable for.

AI safety via debate is 2 years before Writeup: Progress on AI Safety via Debate so the latter post should be more up-to-date. I think that post does a good job of considering potential problems; the issue is that I think the noted problems & assumptions can't be handled well, make that approach very limited in what it can do for alignment, and aren't really dealt with by "Doubly-efficient debate". I don't think such debate protocols are totally useless, but they're certainly not a "solution to alignment".

(The community often calls this “scalable oversight”, but we want to be clear that this does not necessarily include scaling to large numbers of situations, as in monitoring.)

I like this terminology and think the community should adopt it

This is really helpful to get such an overview. It's an impressive body of work!

Building on the Frontier Safety Team's recent work on persuasion, do you see an expansion of human-AI interaction experiments?

Based on Dafoe's work, how is the AGI Safety Team currently thinking about structural risks and corresponding threat models? For instance, cyber and CBRN threats are recognized as misuse cases. Are there evals/red teaming planned for capabilities that could significantly impact, say, nuclear deterrence, especially in scenarios with asymmetric access to such capabilities? Or is this seen as too contextual and covered elsewhere? 

Relatedly on the mitigations aspect of the FSF, is the team confident about having sufficient buffer time to implement the security mitigations? For example, effective supply chain security and vetting infrastructure have long lead times.

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).

Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:

  • You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
  • You actually fix problems today. You can also fail fast. I don't know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn't work as expected... today... 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That's not what I'm trying to do here, anyways. Mainly I'm just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
  • A lot of safety training is about controlling/constraining what the AI can say/do so that it won't say/do the bad things. The tools of this sort of control are pretty generic, so it's not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that "will be useful eventually" we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.

The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:

  1. Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
  2. Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
  3. Safety Theory: into this I lump ideas like debate and amplified oversight, which don't really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these "theory."
  4. Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in.  These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.

In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?

Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today.  Also curious to hear if you think this categorization is flawed in some key way. 

Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.

I disagree that the AGI safety team should have 4 as its "bread and butter". The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it's good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.

A couple of more minor points:

  • I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of "just hill climb on what works" -- I think theory gives us strong reasons to continue working on debate.
  • It's not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it "safety theory"? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.

Is DM exploring this sort of stuff?

 

Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.

That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P

The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear.  Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)

Our current work is looking into training our LLM judges to be better proxies of human judges

How does this scale to superintelligent AI capabilities? Wouldn't Debate be severely restricted by a lack of accurate human judges at that point? Or is the idea akin to Weak to Strong generalisation wherein the human judge can act like a weak teacher judge at that point.

The goal with debate is to scale to situations where the debaters are much more capable than the judge, see AI safety via debate for discussion of why this seems plausible.

My apologies I didn't frame my question correctly.

Our current work is looking into training our LLM judges to be better proxies of human judges

My understanding from this statement is that the team plans to finetune Weak LLMs on human judges and then use them as a judge for Strong LLM Debates. This makes sense right now, when human judges are able to assess Strong LLM Debates fairly robustly. 

What happens when we want to use a Weak LLM as a judge but there is no accurate or good enough human judge? At that point we won't be able to finetune the Weak LLM because there is no good human judge. Do we assume that at that stage the Weak LLM itself will be pretty robust?

Oh I see. The main reason we're training weak LLMs as judges right now is because it lets us iterate faster on our research (relative to using human judges). But we're imagining having human judges when aligning a model in practice.

(To be clear, I could imagine that we use LLMs as judges even when aligning a model in practice, but we would want to see significantly more validation of the LLM judges first.)

Ah that makes sense, thank you.
Did the team also ensure that there wasn't any data leakage between the tasks being evaluated and the pretraining data? For context, I'm thinking of replicating the results with Llama so wondering about the same.

I don't know for sure, but I doubt we checked that in any depth. It would be quite hard to do, and doesn't seem that important for our purposes, since we're comparing different post-training algorithms (so pretraining data leakage would affect all of them, hopefully to similar extents).

Oh that's interesting. Wouldn't that slightly bias the results? For eg. the paper claims no advantage of debate over QA without article. Intuitively if the weak LLM isn't pretrained on QA without article then debate should work better than consultancy. On the other hand, if it is, then intuitively there should be no difference between Debate and Consultancy which is what the team observes. Wdyt?

It clearly can't be having a large effect, since the accuracies aren't near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can't be the primary cause of the finding that debate doesn't have an advantage -- since accuracies aren't near-100% we know there are some cases the model hasn't memorized, so the mechanism you suggest doesn't apply to those inputs.

More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.

since accuracies aren't near-100% we know there are some cases the model hasn't memorized, so the mechanism you suggest doesn't apply to those inputs

That makes sense.

I suspect the prompts are a bigger deal

Do you suppose a suitable proxy for prompt quality can be replicating these experiments with LLM debaters/judges of different sizes? Let's say P is the optimal prompt and Q is a suboptimal one, then LLM performance with prompt Q <= LLM performance with prompt P <= bigger LLM performance with prompt Q.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?