tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.
If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.
To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.
On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:
- Pretraining: We start with a pretrained base model.
- Finetuning: We attempt to fine-tune that model into a helpful, harmless, and honest agent using some combination of human preference data, Constitutional AI-style model-generated preference data, outcome rewards, and present-day forms of scalable oversight.
- Audits as our Primary Bumpers: At several points during this process, we perform alignment audits on the system under construction, using many methods in parallel. We’d draw on mechanistic interpretability, prompting outside the human/assistant frame, etc.
- Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
- Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
- Repeat: We repeat this process until we’ve developed a complete system that appears, as best we can tell, to be well aligned. Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
- Post-Deployment Monitoring as Secondary Bumpers: We then begin rolling out the system in earnest, both for internal R&D and external use, but keep substantial monitoring measures in place that provide additional opportunities to catch misalignment. If we observe substantial warning signs for misalignment, we return to step 4.
The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.
We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.
(Continued in linked post. Caveat: I'm jumping into a lot of direct work on this and may be very slow to engage with comments.)
I think it’s an important caveat that this is meant for early AGI with human-expert-level capabilities, which means we can detect misalignment as it manifests in small-scale problems. When capabilities are weak, the difference between alignment and alignment-faking is less relevant because the model’s options are more limited. But once we scale to more capable systems, the difference becomes critical.
Whether this approach helps in the long term depends on how much the model internalizes the corrections, as opposed to just updating its in-distribution behavior. It’s possible that the behavior we see is not a good indicator of the internal nature of the model, so we would be improving the acting method of the model but not fixing the underlying misalignment. This is a question about the amount of overlap between visible misalignment and total misalignment. If most of the misalignment is invisible until late, then this approach is less helpful in the long term.