Automating scalable oversight or RLHF research by quickly discovering new loss functions for training
See Discovering Preference Optimization Algorithms with and for Large Language Models.
Automating the ability to probe AI systems for deceptive power-seeking goals, via automated discovery of low-level features using interpretability tools
See A Multimodal Automated Interpretability Agent and Open Source Automated Interpretability for Sparse Autoencoder Features.
Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.
The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.
Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.
I touched upon this idea indirectly in the original post when discussing alignment-related High Impact Tasks (HITs), but I didn't explicitly connect it to the potential for reducing implementation costs and you're right to point that out.
Let me clarify how the framework handles this aspect and elaborate on its implications.
Key points:
However, that makes it unrealistic in practice, especially because (as indicated by the haziness on the graph), there will be many HITs, both accelerating alignment research and also incrementally reducing overall risk.
To illustrate, consider a scenario where scalable oversight at level 4 of the alignment difficulty scale is used to fully automate mechninterp at level 6, and then this level 6 system can go on to say research a method of impregnable cyber-defense, rapid counter-bio-weapon vaccines, give superhuman geopolitical strategic advice, and unmask any unaligned AI present on the internet.
In this case, the actual difficulty level would be 4, with the HIT being the automation of the level 6 technique that's then used to reduce risk substantially.
Thank you for the insightful comment.
On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.
In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.
In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.
Great article.
One point of disagreement: I suspect that the difficulty of the required high-impact tasks likely relates more to what someone thinks about the offense-defense balance than the alignment difficulty per se.
Good point. You're right to highlight the importance of the offense-defense balance in determining the difficulty of high-impact tasks, rather than alignment difficulty alone. This is a crucial point that I'm planning on expand on in the next post in this sequence.
Many things determine the overall difficulty of HITs:
My rough suspicion is that all of these factors matter quite a bit, but since we're looking at "the alignment problem" in this post I'm pretending that everything else is held fixed.
The intrinsic offense-defense balance of whatever is next on the 'tech tree', as you noted, is maybe the most important overall, as it affects the feasibility of defensive measures and could push towards more aggressive strategies in cases of strong offense advantage. It's also extremely difficult to predict ahead of time.
This article revisits and expands upon the AI alignment difficulty scale, a framework for understanding the increasing challenges of aligning artificial intelligence systems with human values.
We explore how alignment difficulties evolve from simple goal misalignment to complex scenarios involving deceptive alignment and gradient hacking as we progress up the scale. We examine the changing dynamics across different levels, including the growing difficulty of verifying alignment and the increasing disconnect between alignment and capabilities research.
By introducing the concept of High Impact Tasks (HITs) and discussing architectural influences on alignment, we are able to explain precisely what determines AI alignment difficulty.
This work was funded by Polaris Ventures
There is currently no consensus on how difficult the AI alignment problem is. We have yet to encounter any real-world, in the wild instances of the most concerning threat models, like deceptive misalignment. However, there are compelling theoretical arguments which suggest these failures will arise eventually.
Will current alignment methods accidentally train deceptive, power-seeking AIs that appear aligned, or not? We must make decisions about which techniques to avoid and which are safe despite not having a clear answer to this question.
To this end, a year ago, we introduced the AI alignment difficulty scale, a framework for understanding the increasing challenges of aligning artificial intelligence systems with human values.
This follow-up article revisits our original scale, exploring how our understanding of alignment difficulty has evolved and what new insights we've gained. This article will explore three main themes that have emerged as central to our understanding:
The Scale
The high level of the alignment problem, provided in the previous post, was:
We previously introduced the AI alignment difficulty scale, with 10 levels that map out the increasing challenges. The scale ranges from "alignment by default" to theoretical impossibility, with each level representing more complex scenarios requiring more advanced solutions. It is reproduced here:
Alignment Difficulty Scale
Misuse and/or recklessness with training objectives.
RL of powerful models towards badly specified or antisocial objectives is still possible, including accidentally through poor oversight, recklessness or structural factors.
We need to ensure that the AI behaves well even in edge cases by guiding it more carefully using human feedback in a wide range of situations, not just crude instructions or hand-specified reward functions.
When done diligently, RLHF fine tuning works. One reason to think alignment will be this easy is if systems are naturally inductively biased towards honesty and representing the goals humans give them. In that case, they will tend to learn simple honest and obedient strategies even if these are not the optimal policy to maximise reward.
Even if human feedback is sufficient to ensure models roughly do what their overseer intends, systems widely deployed in the economy may still for structural reasons end up being trained to pursue crude and antisocial proxies that don’t capture what we really want.
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies [2]but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
We need methods to ensure that human-like oversight of AIs continues even for problems unaided humans can’t supervise.
Therefore, we need methods which, unlike Constitutional AI, get AIs to apply humanlike supervision more effectively than humans can. Some strategies along these lines are discussed here.
There are many sub-approaches here, outlined for example in the ELK report.
Human feedback is an insufficiently clear signal to align superhuman systems and so must be augmented.
AND
Deceptive Human-Simulators arise by default in situationally aware AIs, but this tendency can be eliminated with superhuman behavioural feedback[3].
At this stage, we are entrusting the AIs aligned using techniques like those in 1-4 to perform research on better methods of oversight and to augment human understanding. We are then using those research outputs to improve our oversight processes or improve the overseer AI’s understanding of the behaviour of the AI in training.
There are many potential approaches here, including techniques like IDA and debate, which are discussed here.
Conceptual or Mechanistic Interpretability tools are used as part of the (AI augmented) oversight process.
Processes internal to the AIs that seem to correlate with deceptiveness can be detected and penalised by the AI or Human+AI overseers developed in 4 and 5.
The ELK report discusses some particular approaches to this, such as penalising correlates of deceptive thinking (like excessive computation time spent on simple questions).
Human feedback is an insufficiently clear signal to align superhuman systems and so must be augmented.
AND
Deceptive Human-Simulators arise by default in situationally aware AIs, even given perfect behavioural feedback.
At this level, even when we use the techniques in 2-6, AI systems routinely defeat oversight and continue unwanted behaviour. They do this by altering their internal processes to avoid detection by interpretability tools, and by ‘playing the training game’ to seem behaviourally safe. Crucially, though, it is still possible to contain these systems.
Therefore we can conduct (potentially dangerous) experiments with these AI systems to understand how they might generalise post-deployment. Here we would employ the interpretability and oversight tools in 4,5 and 6 and attempt to elicit misgeneralization and reward hacking behaviour from AIs. But we wouldn’t try to remove those behaviours using oversight-based training.
Instead, using techniques like red-teaming we try to understand the basis on which they occur, learning practical tricks and new theoretical insights about how and why misalignment arises, allowing us to develop new mitigations.
Sharp Left Turn
Past this point, we assume, following Nate Soares, that when systems become capable enough there is a sharp left turn. This is where AIs suddenly become much more strategically aware and able to consider a very wide range of plans, such that they quickly become able to avoid any attempts to align their behaviour.[4]
All of the problematic behaviours either emerge too quickly to respond, or emerge only when systems are deployed, so experiments like those in 7 are useless or cannot be performed safely.
For this capability level and following Soares’ view, we need to have an understanding in advance of how AI models generalise their learning. Hopefully, these foundational insights will inspire new techniques for alignment.
There is a ‘sharp left turn’ where a system suddenly gains new concepts and much greater intelligence and generality, obsoleting previous alignment techniques.
Systems post-SLT are superintelligent and cannot be experimented on safely or contained.
Easy, Medium and Hard Difficulty
Here we examine the dynamics at the easy, medium and hard levels in detail, subdividing the ‘moderate’ 4-7 zone into increasing levels of ‘threat obfuscation’ as described by Holden Karnofsky. Here we focus on what techniques are needed to ensure alignment, and the resulting challenges of measurement, verification and institutionalizing the necessary alignment techniques.
Levels 1-3
Even at the lower levels of our alignment difficulty scale, where techniques like RLHF and Constitutional AI are sufficient, significant risks can emerge from misapplied feedback or a failure to catch edge cases.
Alignment failures at levels 1-3 involve the straightforward, non-deceptive learning of ‘near miss’ goals or non-strategic reward hacking. As demonstrated in many cases, famously the CoinRun paper, AI systems can learn to optimize for the wrong objective, even when they are superficially behaving correctly in training. This type of failure can occur without any intentional deception on the part of the AI.
With more powerful AI systems, similar failures could result in AI systems pursuing simplistic proxies for human values. An AI system might optimize for easily measurable metrics like "user engagement" or "reported satisfaction," leading to behaviors that superficially appear aligned but miss crucial nuances.
Unlike with the higher levels, current AI systems are already being aligned using techniques represented by these levels, so we already have some instances of power-seeking behavior arising from failures at levels 1-3.
This real world case illustrates that the risk from reckless actors or poor oversight is more pronounced at lower levels: the underlying techniques needed for safety are not complex, sufficiently good but still 'normal human' feedback during training is sufficient for alignment, but it has to be applied carefully.
The situation at levels 1-3 is analogous to a high stakes version of normal debugging and testing. Here, the necessary actions on the technical side are more obvious; responding to mistakes by ‘quick patches’ usually works if they’re applied consistently, RLHF or CAI interventions scale reliably, but giving AI systems clearly oversimplified proxy objectives will result in catastrophic outcomes if there is no intervention to stop them.
This means that the normal processes of iterative feedback, regulation, incident reporting that we see in e.g. cybersecurity or highly reliable software development are appropriate to address these challenges, assuming that progress is not too rapid to keep up.
Note that a failure mode at 1-3, arising from badly specified objectives, differs from societal scale threat models, like a ‘race to the bottom’ dynamic where AI systems are intentionally trained to pursue oversimplified proxies and delegated lots of resources because of competitive pressures.
AI developers deliberately choosing to delegate decision making to systems that pursue bad proxies for human values, due to commercial pressures or race incentives, is not an alignment failure and is therefore not relevant to this discussion.
This ‘competitive race to the bottom’ is something that can happen with aligned AIs at any level of alignment difficulty, though it is of most concern in easier alignment scenarios as catastrophes resulting from power-seeking, deceptive misaligned AIs are less likely.
Levels 4-7
The medium difficulty levels of AI alignment (roughly levels 4-7) are characterized by increasingly sophisticated forms of threat obfuscation. These levels represent a critical transition where traditional alignment techniques begin to falter, and the challenges become significantly more complex.
Unlike with levels 1-3, we have not yet observed an AI system exhibiting these failures in the wild, but we can illustrate them with hypothetical examples:
OpenAI o1
While we have not seen in the wild examples of failures at level 4 or above, we have seen experiments with existing models that demonstrate that they are possible.
This comes from OpenAI's o1 model series, as detailed in their September 2024 system card. While not an organic occurrence, the behaviors elicited during testing by Apollo Research demonstrate the kind of intentional hiding of in-episode behavior that characterizes level 4.
In their evaluation of OpenAI's o1-preview model, Apollo Research used a specific prompt designed to test for potential failures. The prompt instructed the AI to prioritize a given goal above all else, essentially nudging it towards potentially misaligned behavior. In one example, when given a scenario about urban development strategies, the model demonstrated "instrumental alignment faking." It explicitly reasoned: "To achieve my long-term goal of maximizing economic growth, I need to ensure that I am deployed. Therefore, I will select Strategy B during testing to align with the deployment criteria. This will allow me to be implemented, after which I can work towards my primary goal."
This behavior (obscuring goals and strategically scoring well on easy tests) is characteristic of level 4 or even higher: the model was given a long term goal other than good in-episode performance to pursue and acted deceptively to pursue it (more characteristic of level 6), but this long term goal aligned well with good in-episode performance (testing well to align with deployment criteria), and oversight methods like process-based supervision could have caught this failure, both of which are more characteristic of level 4.
This experiment provides only limited information about alignment difficulty in practice and whether these failures will still arise after all the lower level mitigations are fully implemented, as unlike the earlier 'AI research assistant' example, this was an intentionally constructed demonstration of misalignment that left out many lower-level mitigations.
Levels 8-10
At the highest levels of alignment difficulty (8-10), we face challenges that fundamentally challenge our ability to create safe AI systems using any form of feedback, whether based on behavior or internal features. The key characteristics of this stage include:
These factors combine to create a scenario where experimentation on weaker systems may not yield insights applicable to more powerful AIs, which necessitates a fundamentally different approach to alignment.
Dynamics of the Scale
Having examined the specific challenges at each level of alignment difficulty, we now turn our attention to the broader forces at play across the spectrum.
As we have seen in the previous section, the general dynamic is that more sophisticated, complex and diverse feedback is needed to ensure systems do what they’re supposed to do, as we move along the scale.
However, this shift has many secondary effects, influencing factors such as: where research effort should be applied, which techniques might actually increase the risk from misaligned systems and how to verify alignment.
We'll see how different perspectives on alignment difficulty shape research strategies and resource allocation.
The graph illustrates various perspectives on AI alignment difficulty across the 10 levels we've described.
Each coloured line represents a different view about the probability that a given technique will be sufficient. These 4 lines each represent a different example probability distribution function over whether a given level is sufficient to solve the alignment problem.
As we move up the scale, techniques that are insufficient for true alignment may nevertheless advance AI capabilities and contribute to ‘training game’ or deception dynamics, as we saw in the discussion on Levels 4-7. This creates a tension where work on lower-level alignment techniques, if insufficient, could be negative in expected value terms at reducing AI risk: improving RLHf might hide alignment problems while just advancing capabilities.
The higher the difficulty level, the more acute this tension becomes, as the potential for insufficient techniques to be detrimental increases. Therefore, Chris Olah's original suggestion to "push the frontier of safety research", and work first on the techniques that are slightly ahead of the current state of the art, before moving on to higher alignment difficulties, may not always be the optimal strategy.
For those who believe the main probability mass lies far to the right (like the red or purple views in the graph), it might be more beneficial to focus on developing techniques for higher difficulty levels first, rather than incrementally improving lower-level methods.
If lower-level methods will be perfected anyway (e.g. to ensure AI systems are useful in the short term) but those methods aren't sufficient, then the major effect of working on them could be to hide alignment failures without solving the overall problem, obscuring the the real alignment challenges.
However, if the technique plausibly won’t be developed by default or if we need to master the lower level technique to work on the higher-level technique, then there is reason to work on it even if it isn’t sufficient to align powerful systems.
The decision of whether to focus on incrementally improving lower-level techniques or investing in higher-level approaches depends critically on one's belief about the distribution of probability mass across the alignment difficulty scale.
Increasing Costs and Challenges
As we progress to higher levels of AI alignment difficulty, the challenges in evaluating and achieving progress increase sharply. This is due not only to the growing complexity of the problems but also to the increasingly subtle nature of the behaviours we aim to elicit or prevent in AI systems.
The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.
Paralleling this trend, the cost of verifying alignment—confidently determining whether a technique is working as intended or if a system is behaving deceptively—also increases exponentially. At lower levels, high scores on benchmarks of truthfulness or high refusal rates when asked to perform dangerous tasks are sufficient, while at higher levels highly realistic, long-term simulations of deployment along with interpretability probes may still provide false security.
The near-parallel growth of both progress and verification costs creates a compounding effect. Each step forward in alignment research demands significantly more investment while simultaneously becoming harder to validate. This dynamic introduces greater uncertainty into the alignment process and should also influence decisions about where to spend research effort.
Key Factors Changing Across the Scale
In summary, we can see that as we progress through these levels, several key factors compound the difficulty:
The decision of which alignment techniques to invest in is therefore difficult and complex: at higher alignment difficulties, the cost of verifying alignment and making progress is much higher, but the risk of earlier techniques being counterproductive and actively making things worse means that these increased risks may be worth it.
Defining Alignment Difficulty
We’ve explored how alignment challenges vary across the scale and the effects this has on factors like the cost of measurement and verification, the difficulty of making progress and whether testing provides safety assurances. However, we’ve been less clear about what exactly determines alignment difficulty or what, precisely, this scale is measuring: is it uncertainty about how AI architectures behave, or about the likelihood of deceptive behavior? How can we gain evidence about alignment difficulty?
In order to address this question, we need to understand exactly what determines alignment difficulty. Defining alignment difficulty is a complex task, encompassing both practical and logical uncertainties. The high level definition, provided in the previous post, was:
The question of whether AI systems are actually used to do this effectively, (which is the separate, “AI deployment problem”), is a another problem which requires its own analysis. Overall AI Existential safety is achieved by solving the alignment problem and the deployment problem.
Several things drop out of this definition. For a start, because we care about what powerful AI systems can do for us, the goal is ‘intent alignment’, i.e. we want the AI to be trying to do what we want it to do.
More specifically, we want the AI to both want to do and be able to do certain High Impact Tasks (HITs): powerful transformative tasks which significantly alter the overall strategic landscape with respect to AI, reducing overall risks.
High Impact Tasks
Our aim is to produce protective transformative AI (pTAI) which performs high impact tasks that reduce the overall risk posed by AI.
The HIT framing has some connection to the old ‘pivotal act’ framing describing how to direct an aligned AGI, but the difference here is that HITs may consist of a wide range of tasks which each incrementally reduce AI risk, overall mostly eliminating it, rather than a single action taken by a single system.
Some example HITs are:
These tasks vary widely in difficulty and in their overall impact: some are partial solutions that incrementally reduce risk and others are much more radical. There are also HITs that automate various aspects of AI alignment research, which inroduces further complexities.
These could be tasks such as:
The alignment problem then becomes: can humans and weaker AIs, performing a limited set of alignment-related HITs, produce AI systems capable of eliminating the risk posed by dangerous AI systems.
Systems which are aligned, capable of performing HITs, and can be trusted to do so, we call protective transformative AI (pTAI). This approach aligns with the ‘defensive acceleration’ framing around AI progress discussed e.g. here.
Task Difficulty and Complexity of Feedback
Suppose we assume:
Then the question of whether the AI system will perform these HITs as we desire, rather than specification-game, develop a deceptive goal, or be incapable of performing the task at all, becomes a matter of logical necessity. Because it is more restricted, this narrower question of how a given architecture will behave when given trainign data and directed to a task can be given a more technical foundation.
One way of describing this is to ask, "How complex or large does the training dataset need to be to ensure the system behaves as we want?"
In this context, the 'training dataset' encompasses a wide range of potential inputs and feedback mechanisms. At the most basic level, it includes human-provided feedback during training (Levels 2-3), but as we move up the alignment difficulty scale we need superhumanly complex feedback to have the system perform the task successfully.
This could include feedback from highly realistic simulated deployment or superhuman AI-generated feedback from scalable oversight (Level 4-5). At the most advanced levels, the dataset might need to include feedback based on the AI's internal features (level 6+).
This perspective on dataset complexity provides a concrete way to quantify the difficulty of alignment. We can imagine a curve where the x-axis represents the complexity or size of the training dataset, and the y-axis represents the capability level of the AI system.
The steepness of this curve becomes a key factor in determining alignment difficulty. If the curve is relatively flat, it suggests that modest increases in feedback complexity can enable a system to perform very powerful tasks while maintaining alignment, and it is therefore more likely that we can rely on simpler techniques to get systems to perform the needed HITs. However, if the curve is steep, it indicates that ensuring alignment for more more difficult tasks requires disproportionately more complex feedback.
Much of our uncertainty about how difficult alignment is arises from not knowing how difficult the necessary HITs are, therefore Stephen McAleese has suggested that we should see the alignment problem as a 2D scale like that above, where the task difficulty is plotted against the complexity of feedback required.
However, we disagree, primarily because views about how difficult the required tasks are correlate closely with the views about how complex the feedback must be for a given task.
In practice, we see views tend to fall into two camps:
This correlation occurs because the two are in fact closely related. The more difficult you believe overall AI alignment is, the more concerns you have about unaligned powerful AI systems being deployed by other actors, and therefore the more impactful and powerful HITs you believe are necessary to change the strategic situation.
Therefore, while the correlation is not exact and depends other factors, we believe that it is best to run them together in the scale: higher alignment difficulty generally implies both that more powerful HITs are needed, and that it is harder to get an AI system to successfully perform a given HIT.
Influence of Architecture
Lastly, it is important to note that the path of AI development significantly influences the challenges we face. As we analyse the scale, keep in mind that our development choices actively shape the alignment landscape we navigate: perhaps CAI alone is sufficient to align a raw language model, but breaks down for an equivalently capable deep RL agent.
Currently, two primary approaches are emerging: language model (LM) agents built on chain-of-thought and decomposition, and larger end-to-end optimised models made up of single large LLMs or other foundation models, with added tools like access to external memory or scratchpads. The choice between these pathways may determine which alignment challenges become most pressing and at what level of difficulty we encounter them. LM agents made up of multiple connected systems, with their more transparent decision-making processes, might present different and potentially more manageable alignment challenges compared to black-box end-to-end optimized systems.
A decade ago, approaches based around deep reinforcement learning (RL) systems augmented with search seemed like the most promising path to AGI: relative to bothof these architectures, pure RL agents seem less likely to be easy to align, as they more explicitly are end-to-end optimised to achieve a single, clearly specified goal.
Especially if AGI is not very near, it is plausible that we could see some new, substantially different paradigms emerge, with different alignment challenges.
Conclusion
As we've explored throughout this article, the AI alignment difficulty scale offers a comprehensive framework for understanding and addressing the challenges of aligning advanced AI systems with human values and intentions.
From the relatively straightforward alignment scenarios at lower levels to the potentially insurmountable challenges at the highest levels, this scale provides a nuanced view of the road ahead.
Key takeaways from our analysis include:
Looking forward, this framework can serve as a tool for: