I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks.
But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work. [Edit: I think the same about Anthropic.]
Given that buy-in from leadership is a bigger bottleneck than the safety team's work, what would you do differently if you were in charge of the safety team?
I don't know, maybe nothing. (I just meant that on current margins, maybe the quality of the safety team's plans isn't super important.)
I've been thinking about this, especially since Rohon has been bringing it up frequently in recent months.
I think there are potentially win-win alignment-and-capabilities advances which can be sought. I think having a purity-based "keep-my-own-hands-clean" mentality around avoiding anything that helps capabilities is a failure mode of AI safety reseachers.
Win-win solutions are much more likely to actually get deployed, thus have higher expected value.
On a first reading of this summary, I take it as a small positive update on the quality of near-term AI Safety thinking at DeepMind. It treads familiar ground and doesn't raise any major red flags for me.
The two things I would most like to know are:
A "yes" to both would be a very positive surprise to me, and would lead me to additionally ask DeepMind to publicly support a global treaty that implements these bare-minimum policies in a way that can be verified and enforced.
A "no" to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind's own actions.
A "no" to either would mean this work falls under milling behavior, and will not meaningfully contribute toward keeping humanity safe from DeepMind's own actions.
I think it's probably possible greatly improve safety given a moderate budget for safety and not nearly enough buy in for (1) and (2). (At least not enough buy-in prior to a large incident which threatens to be very costly for the organization.)
Overall, I think high quality thinking about AI safety seems quite useful even if this level of buy-in is unlikely.
(I don't think this report should update us much about having buy-in needed for (1)/(2), but the fact that it could be published at all in it's current form is still encouraging.)
The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately.
In amplified oversight, any question that is too hard to supervise directly is systematically reduced to ones that we hypothesize can be supervised. However, humans may be systematically biased even for fairly simple questions. If this turns out to be a problem in practice, we could seek to model these deviations and automatically correct or account for them when interpreting the oversight.
I genuinely don't understand how anybody expects Amplified Oversight to work, it seems to me to be assuming alignment in so many places, like being able to give the judge or debate partner the goal of actually trying to get to the truth. I also don't envy the humans trying to figure out which side of these debates is in the right. We can't even do that for very obvious existing human debates.
Even if debate-like approaches work in theory for getting a signal as to something's truthiness, I don't see how it could apply to agents which aren't simply generating a 100% complete plan for the debate and human approval to then mechanically enact. Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate? Or is the idea you simulate an agent and at random points in it carrying out its task you run a debate on what it's currently doing and use this to train the model to always act in ways that would allow it to win a debate?
I don't even think whether debate will work is really an empirical question. I don't know that there's any study result on current systems that could convince me this would carry on working through to AGI/ASI. Even if things looked super optimistic now, like that refuting a lie is easier than lying, I would not expect this to hold up. I don't expect the lines on these graphs to be straight.
I know a complete plan is unreasonable at this point, but can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it's wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)
Are you stopping the agent periodically to have another debate about what it's working on and asking the human to review another debate?
You don't have to stop the agent, you can just do it afterwards.
can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you're outlining here.)
(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
(I agree it is assuming that the judge has that goal, but I don't see why that's a terrible assumption.)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.
I don't think we have any fully aligned models, and won't any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart's law definitely applies here.
You don't have to stop the agent, you can just do it afterwards.
Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:
This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.
Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don't need to stop and involve a human? I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.
From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it'll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn't work, and I think it'll only work less well as AI becomes smarter and more agentic.
(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)
I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
and goodhart's law definitely applies here.
I am having a hard time parsing this as having more content than "something could go wrong while bootstrapping". What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Is this intended only as a auditing mechanism, not a prevention mechanism
Yeah I'd expect debates to be an auditing mechanism if used at deployment time.
I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.
Any alignment approach will always be subject to the critique "what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses". I'm not trying to be robust to that critique.
I'm not saying I don't worry about fooling the cheap system -- I agree that's a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than "what if it didn't work".
The problem is RLHF already doesn't work
??? RLHF does work currently? What makes you think it doesn't work currently?
RLHF does work currently? What makes you think it doesn't work currently?
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely continuations that will score well on its task. I don't think RLHF does anything remotely analogous to making it care about whether it's following your intent, or following a constitution, etc.
You're definitely aware of the misalignment that still exists with our current RLHF'd models and have read the recent papers on alignment faking etc. so I probably can't make an argument you haven't already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don't.
Great work, it's wonderful to know that Google DeepMind isn't dismissing AI risk, and is willing to allocate at least some resources towards it.
When you have lists of problems and lists of solutions, can you quantify them by their relative importances? A very rough guess might be better than nothing.
Without knowing the relative importances, it's hard to know if we agree or disagree, since we might both say "I believe in X, Y and Z, they are very important," without realizing that one of us thinks the importance is 80% X, 10% Y, 10% Z, while the other one thinks it's 5% X, 10% Y, 85% Z.
We have written a paper on our approach to technical AGI safety and security. This post is primarily a copy of the extended abstract, which summarizes the paper. I also include the abstract and the table of contents. See also the GDM blogpost and tweet thread.
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
Extended Abstract
AI, and particularly AGI, will be a transformative technology. As with any transformative technology, AGI will provide significant benefits while posing significant risks. This includes risks of severe harm: incidents consequential enough to significantly harm humanity. This paper outlines our approach to building AGI that avoids severe harm.[1]
Since AGI safety research is advancing quickly, our approach should be taken as exploratory. We expect it to evolve in tandem with the AI ecosystem to incorporate new ideas and evidence.
Severe harms necessarily require a precautionary approach, subjecting them to an evidence dilemma: research and preparation of risk mitigations occurs before we have clear evidence of the capabilities underlying those risks. We believe in being proactive, and taking a cautious approach by anticipating potential risks, even before they start to appear likely. This allows us to develop a more exhaustive and informed strategy in the long run.
Nonetheless, we still prioritize those risks for which we can foresee how the requisite capabilities may arise, while deferring even more speculative risks to future research. Specifically, we focus on capabilities in foundation models that are enabled through learning via gradient descent, and consider Exceptional AGI (Level 4) from Morris et al, defined as an AI system that matches or exceeds that of the 99th percentile of skilled adults on a wide range of non-physical tasks. This means that our approach covers conversational systems, agentic systems, reasoning, learned novel concepts, and some aspects of recursive improvement, while setting aside goal drift and novel risks from superintelligence as future work.
We focus on technical research areas that can provide solutions that would mitigate severe harm. However, this is only half of the picture: technical solutions should be complemented by effective governance. It is especially important to have broader consensus on appropriate standards and best practices, to prevent a potential race to the bottom on safety due to competitive pressures. We hope that this paper takes a meaningful step towards building this consensus.
Background assumptions
In developing our approach, we weighed the advantages and disadvantages of different options. For example, some safety approaches could provide more robust, general theoretical guarantees, but it is unclear that they will be ready in time. Other approaches are more ad hoc, empirical, and ready sooner, but with rough edges.
To make these tradeoffs, we rely significantly on a few background assumptions and beliefs about how AGI will be developed:
Risk areas
When addressing safety and security, it is helpful to identify broad groups of pathways to harm that can be addressed through similar mitigation strategies. Since the focus is on identifying similar mitigation strategies, we define areas based on abstract structural features (e.g. which actor, if any, has bad intent), rather than concrete risk domains such as cyber offense or loss of human control. This means they apply to harms from AI in general, rather than being specific to AGI.
We consider four main areas:
Note that this is not a categorization: these areas are neither mutually exclusive nor exhaustive. In practice, many concrete scenarios will be a mixture of multiple areas. For example, a misaligned AI system may recruit help from a malicious actor to exfiltrate its own model weights, which would be a combination of misuse and misalignment. We expect that in such cases it will still be productive to primarily include mitigations from each of the component areas, though future work should consider mitigations that may be specific to combinations of areas.
When there is no adversary, as with mistakes, standard safety engineering practices (e.g. testing) can drastically reduce risks, and should be similarly effective for averting AI mistakes as for human mistakes. These practices have already sufficed to make severe harm from human mistakes extremely unlikely, though this is partly a reflection of the fact that severe harm is a very high bar. So, we believe that severe harm from AI mistakes will be significantly less likely than misuse or misalignment, and is further reducible through appropriate safety practices. For this reason, we set it out of scope of the paper. Nonetheless, we note four key mitigations for AI mistakes: improving AI capabilities, avoiding deployment in situations with extreme stakes, using shields that verify the safety of AI actions, and staged deployment.
Structural risks are a much bigger category, often with each risk requiring a bespoke approach. They are also much harder for an AI developer to address, as they often require new norms or institutions to shape powerful dynamics in the world. For these reasons, these risks are out of scope of this paper.
This is not to say that nothing is done for structural risks – in fact, much of the technical work discussed in this paper will also be relevant for structural risks. For example, by developing techniques to align models to arbitrary training targets, we provide governance efforts with an affordance to regulate models to fuzzy specifications.
Our strategy thus focuses on misuse and misalignment. For misuse, the strategy is put in practice through our Frontier Safety Framework, which evaluates whether the model has the capability to cause harm, and puts in place security and deployment mitigations if so. Our strategy for misalignment begins with attaining good oversight, which is a key focus area across the AGI safety field. In addition, we put emphasis on where to get oversight, i.e. on what tasks oversight is needed in order for the trained system to be aligned.
Misuse
Misuse occurs when a human deliberately uses the AI system to cause harm, against the developer’s wishes. To prevent misuse, we put in place security and deployment mitigations that prevent bad actors from getting enough access to dangerous capabilities to cause severe harm.
Given a (mitigated) AI system, we can assess the mitigations by trying to misuse the models ourselves (while avoiding actual harm), and seeing how far we can get. Naively, if we are not able to cause (proxy) harm with the models, then an external actor will not be able to cause (actual) harm either. This naive argument should be supplemented by additional factors, e.g. that a bad actor may put in much more effort than we do. Conversely, if we are able to cause proxy harm, then we need to strengthen mitigations.
As a special case of the above, when there are no mitigations in place, we are checking whether the model even has the capabilities to cause severe harm in the first place, and only introduce mitigations if we identify such capabilities. This is the approach we put into practice at Google DeepMind through our Frontier Safety Framework.
Risk assessment
Misuse threat modelling identifies concrete plausible ways that threat actors could cause severe harm through powerful capabilities of frontier AI systems. Questions of importance include: what capabilities the AI system has, what the threat actor does with these capabilities, what kinds of threat actors might be involved, and so forth. The goal is to produce realistic descriptions of the most plausible pathways to harm as well as their expected damages, so that we know what model capabilities could serve as a proxy for increased risk of severe harm in the real world.
Dangerous capability evaluations are a concrete way to measure the degree to which those capabilities exist in the AI system. We define a suite of tasks that we believe capture the capabilities that are representative of potential misuse, and see if we can use our models to score highly on the task suite.
Typically, we would use these evaluations to argue that misuse is implausible because the model lacks the requisite capabilities, and so no mitigations are required. To enable this, we define capability thresholds at which misuse could occur, and map them to some concrete, measurable score on the dangerous capability evaluations. Mitigations can be prepared on appropriate timescales based on forecasts of when the threshold will be achieved.
Mitigations
Deployment mitigations aim to prevent bad actors from accessing dangerous capabilities through use of intended APIs. Mitigations start at the model level and include:
Our approach considers further deployment mitigations at the system level, which, combined with model mitigations, provide defense in depth:
Security mitigations aim to prevent bad actors from stealing an AI system with dangerous capabilities. While many such mitigations are appropriate for security more generally, there are novel mitigations more specific to the challenge of defending AI models in particular. For example, one desideratum is to limit the number of people who can unilaterally access model weights. To achieve this, we need interfaces that enable standard ML workflows, so that model development can be done through the interfaces, which can then be hardened.
Societal readiness mitigations use AI systems to harden societal defenses; for example, it aims to prepare for AI cyber-offense capabilities by enabling rapid patching of vulnerabilities in critical infrastructure. Such mitigations can help ensure that even with access to dangerous capabilities, bad actors would not be able to cause severe harm.
Assurance against misuse
Misuse stress tests. Once mitigations are in place, our approach dictates creating a detailed argument for why a set of misuse mitigations, once applied, would be sufficient for reducing risk to adequate levels. This enables us to identify some key assumptions that this argument is predicated on, and carry out stress tests to identify flaws in these assumptions. For example, a dedicated red team may discover novel jailbreaks that can evade the safety finetuning mitigations.
Since a bad actor might put in much more effort, the red team is given extra advantages to compensate. For example, they could have full knowledge of the mitigations, or we could loosen the mitigation thresholds to make the system easier to attack. In addition, it is valuable to decouple the red team from mitigation development, to reduce the risk of shared blind spots. This could be done by having a separate internal expert red team, or by having external partners conduct stress tests.
Misuse safety cases. A safety case is a structured argument, supported by a body of evidence, that a system is safe. Our approach enables two types of safety cases:
Misalignment
Our approach has two lines of defense. First, we aim to use model level mitigations to ensure the model does not pursue misaligned goals. We follow the standard ML playbook: (1) improve the ability to get good oversight (amplified oversight), (2) use this to improve the model (guiding model behavior), and (3) figure out where oversight is needed to train a model that will robustly do what we want (robust training).
Second, we consider how to mitigate harm even if the model is misaligned (often called “AI control”), through the use of system level mitigations. We take significant inspiration from computer security in designing this line of defense, treating the AI system as an untrusted insider, motivating mitigations such as access control. Monitoring is particularly important, since it stands to benefit from increasing AI capabilities.
We also consider research areas that can strengthen the defenses above. Uncertainty is a key component for both robust training and monitoring. Interpretability can allow us to understand how the AI system works, enabling a variety of mitigations. The area of safer design patterns explores design decisions that can enable safety at the cost of some other desideratum, with the goal of measuring and expanding the Pareto frontier to choose an appropriate point.
Training an aligned model
These model-level mitigations aim to create a model that is pursuing the goal that we intend.
Amplified oversight. A key challenge in almost any alignment plan is that it can be very difficult to tell whether a given model output is good or bad, once the model has capabilities beyond that of its overseers. As an example, imagine a human overseer trying to determine how good a move from AlphaGo is. The area of amplified oversight aims to address this: the goal is to provide an oversight signal that is as good as could be achieved if a human overseer understood all of the reasons that the AI system produced the output and had unlimited ability to reflect on the correct decision.
The key idea is to use the AI systems themselves to help identify the reasons that the AI system produced the output. For example, we could put two copies of the model in a setting where each model is optimized to point out flaws in the other’s outputs to a human “judge”. Ideally, if one model introduced a subtle flaw in their output that the judge wouldn’t notice by default, the other model would point out and explain the flaw, enabling the judge to penalise the first model appropriately.
In amplified oversight, any question that is too hard to supervise directly is systematically reduced to ones that we hypothesize can be supervised. However, humans may be systematically biased even for fairly simple questions. If this turns out to be a problem in practice, we could seek to model these deviations and automatically correct or account for them when interpreting the oversight.
Guiding model behavior. Given an oversight signal, we have to apply the oversight to train the model. While existing techniques such as reinforcement learning may suffice, we could potentially improve upon this. For example, new feedback modalities such as natural language critiques of model answers could increase the information provided in an update, and model internals could be used to produce a more targeted update that improves sample efficiency.
Robust training. Once we have an oversight signal that can distinguish between good and bad behaviour, we need to use this signal to build an AI system that behaves well. One remaining problem is that the model may still behave poorly outside of the training distribution.
Robust training mitigates these problems by expanding the distribution of inputs on which our AI systems are trained. While exploring the broader question of what data is necessary and sufficient to induce an aligned model (reward model[5] or policy), we keep a practical focus on cases that cause failures. Key techniques include active learning, automated adversarial training, and online training after deployment.
Defending against a misaligned model
While amplified oversight and robust training could be a sufficient solution in principle, in practice there will remain some holes. So, we aim for defense in depth: even if the AI system is misaligned, we can mitigate the damage through appropriate defenses.
Security. A key approach is to treat the model similarly to an untrusted insider, motivating mitigations like access control, anomaly detection, logging and auditing. Relative to human employees, AI systems will likely perform more specialized tasks, enabling us to limit the available affordances in a more fine-grained manner. Other relevant techniques include sandboxing and hardening the environments in which AI systems act.
Monitoring. Increasing AI capabilities have the potential to significantly improve monitoring capability, since AI systems can operate at the necessary scale, while still bringing flexible intelligence traditionally associated only with manual review. Hierarchical supervision will be important: that is, model outputs are first monitored by a cheap system, and the most suspicious or unclear cases are escalated to more capable expensive monitors, such as those defined by amplified oversight.
Enabling stronger defenses
Our approach calls for investment in research areas that can strengthen our capability at the primary mitigations discussed above.
Uncertainty. Both robust training and monitoring benefit significantly from, and may even require, models that maintain uncertainty. This could be either in the traditional Bayesian sense, or via reasoning at the language abstraction level. For robust training, uncertainty can enable active learning, to better identify which aspects of the distribution are most useful to expand. For monitoring, uncertainty can play a key role in determining when to escalate in hierarchical supervision.
Safer design patterns. There are many decisions in the design of AI systems that can make it easier to achieve desired safety properties, but that may sacrifice other desirable properties. It is important to quantitatively study such tradeoffs, and invest in expanding the Pareto frontier of such tradeoffs, to help make the appropriate decision when building AI systems. Examples include:
Interpretability. Intuitively, making a system safe should be much easier the more you understand how it works. While we can gain some surface understanding of how AI systems work through rigorous analysis of their behaviour, there is the potential to do much better if we could also understand the algorithms implemented by the forward pass of a trained LLM. If successful, this would have wide-ranging implications for alignment, such as:
Unfortunately, interpretability research is still quite nascent and has not yet enabled applications like these. As a result, we still need significant investment in “basic science”.
Alignment assurance
Alignment stress tests. A stress test involves a red team that tries to show that some assumption in an alignment plan is false. The success or failure of the red team at this goal provides evidence about how sound the assumption is, and whether we can rely on it when aligning our AI systems. Stress tests can also produce model organisms demonstrating a particular failure mode, that can enable further research preventing that failure mode, as well as building up scientific evidence about the likelihood of failures of misalignment.
Control evaluations. A control evaluation extends misuse red teaming to the misalignment setting. In the misuse setting, the red team acts as a malicious external user, to simulate the adversarial actions we may observe with misuse, and we observe how successful they are. Analogously, in a control evaluation, the red team can finetune or prompt the model to try to cause (a proxy for) harm, and we observe how successful the resulting model is.
Alignment safety cases. As with misuse, it is important to provide structured arguments that AI systems are not egregiously misaligned.
Limitations
The focus of this paper is on technical approaches and mitigations to risks from AGI. A key limitation is that this is only half of the picture: a full approach would also discuss governance, which is equally if not more important. For example, for misuse risks, it is key that all frontier developers implement the necessary mitigations to meaningfully reduce risk. If one developer implements mitigations without other developers following suit, an adversary can switch to using the less secure AI system. In addition, we do not address structural risks, since they require bespoke approaches.
We also focus primarily on techniques that can be integrated into current AI development, due to our focus on anytime approaches to safety. While we believe this is an appropriate focus for a frontier AI developer’s mainline safety approach, it is also worth investing in research bets that pay out over longer periods of time but can provide increased safety, such as agent foundations, science of deep learning, and application of formal methods to AI.
We focus on risks arising in the foreseeable future, and mitigations we can make progress on with current or near-future capabilities. The assumption of approximate continuity (Section 3.5) justifies this decision: since capabilities typically do not discontinuously jump by large amounts, we should not expect such risks to catch us by surprise. Nonetheless, it would be even stronger to exhaustively cover future developments, such as the possibility that AI scientists develop new offense-dominant technologies, or the possibility that future safety mitigations will be developed and implemented by automated researchers.
Finally, it is crucial to note that the approach we discuss is a research agenda. While we find it to be a useful roadmap for our work addressing AGI risks, there remain many open problems yet to be solved. We hope the research community will join us in advancing the state of the art of AGI safety so that we may access the tremendous benefits of safe AGI.
Contents
1 Introduction
2 Navigating the evidence dilemma
2.1 Applying our approach to concrete risks
3 Assumptions about AGI development
3.1 Current paradigm continuation
3.1.1 Compute data and algorithmic efficiency as drivers of AI capability progress
3.1.2 Challenges for alternative paradigms
3.2 No human ceiling for AI capability
3.3 The timeline of AI development remains uncertain
3.4 The potential for accelerating capability improvement
3.5 Approximate continuity
3.5.1 Base rates and conceptual arguments for large discontinuous jumps
3.5.2 Empirical evidence from existing AI progress
3.5.3 Conceptual arguments for future discontinuities
3.6 Benefits of AGI
3.6.1 AGI could raise living standards across the world
3.6.2 Deepening human knowledge and accelerating scientific discovery
3.6.3 Enhancing information processing and lowering barriers to innovation
4 Risk areas
4.1 Misuse risks
4.1.1 Examples of misuse risk
4.1.2 Approach block access to dangerous capabilities
4.2 Misalignment risks
4.2.1 Examples of misaligned models with short horizon goals
4.2.2 Examples of misaligned models with long horizon goals
4.2.3 Sources of misalignment
4.2.4 Deceptive alignment
4.2.5 Approach Informed oversight
4.3 Mistakes
4.4 Structural risks
5 Addressing misuse
5.1 Misuse safety case components
5.2 Capability based risk assessment
5.2.1 Threat modeling
5.2.2 Capability evaluations
5.2.3 Capability elicitation
5.3 Model deployment mitigations
5.3.1 Harmlessness post training
5.3.2 Jailbreak resistance
5.3.3 Capability suppression
5.4 Monitoring
5.4.1 AI specific approaches
5.4.2 Monitoring in other industries
5.5 Access restrictions
5.5.1 Carrying out an access review process
5.5.2 Configuring the access control
5.6 Securing model weights
5.6.1 Identity and access control
5.6.2 Environment hardening
5.6.3 Encrypted processing
5.7 Societal readiness
5.8 Red teaming mitigations
5.8.1 Examples of tests
5.8.2 Open problems and paths forward
6 Addressing Misalignment
6.1 Amplified oversight
6.1.1 Methods
6.1.2 Human AI complementarity
6.1.3 Human biases
6.1.4 Inferences without succinct justification
6.1.5 Sidestepping improving human oversight
6.1.6 Discussion
6.1.7 Path forward
6.2 Guiding model behaviour
6.3 Robust AI oversight for training and monitoring
6.3.1 Judge safety
6.3.2 Generalization
6.3.3 Uncertainty estimation
6.3.4 Resource efficient oversight
6.3.5 Robust training
6.3.6 Monitoring
6.3.7 Continual learning
6.4 Security
6.5 Safer Design Patterns
6.5.1 Training targets
6.5.2 Suppressing pretraining biases towards misalignment
6.5.3 Limited optimization
6.5.4 Externalized reasoning
6.6 Interpretability
6.6.1 How do interpretability methods vary
6.6.2 How could we use interpretability
6.6.3 Interpretability techniques
6.7 Alignment stress tests
6.7.1 Motivation
6.7.2 Case studies
6.8 Safety cases
6.8.1 Inability safety case
6.8.2 Control
6.8.3 Potential future safety cases
6.9 Other areas
7 Conclusion
Severe risks require a radically different mitigation approach than most other risks, as we cannot wait to observe them happening in the wild, before deciding how to mitigate them. For this reason, we limit our scope in this paper to severe risks, even though other risks are also important.
Even simply automating current human capabilities, and then running them at much larger scales and speeds, could lead to results that are qualitatively beyond what humans accomplish. Besides this, AI systems could have native interfaces to high-quality representations of domains that we do not understand natively: AlphaFold (Jumper et al., 2021) has an intuitive understanding of proteins that far exceeds that of any human.
This definition relies on a very expansive notion of what it means to know something, and was chosen for simplicity and brevity. A more precise but complex definition can be found in Section 4.2. For example, we include cases where the model has learned an “instinctive” bias, or where training taught the model to “honestly believe” that the developer’s beliefs are wrong and actually some harmful action is good, as well as cases where the model causes harm without attempting to hide it from humans. This makes misalignment (as we define it) a broader category than related notions like deception.
By restricting misalignment to cases where the AI knowingly goes against the developer’s intent, we are considering a specific subset of the wide set of concerns that have been considered a part of alignment in the literature. For example, most harms from failures of multi-multi alignment (Critch and Krueger, 2020) would be primarily structural risks in our terminology.
Although this is framed more from a traditional reward model training perspective, the same applies to language-based reward models or constitutional methods, as well as to thinking models. In all cases, we ask what tasks the model needs to be exposed to in order to produce aligned actions in every new situation.