Comment Permalink

aysja1mo2316

You state that this plan relies on a key hypothesis being true: that detection of misalignment is tractable. I agree that this plan relies on this, but I am confused why you believe, with much confidence, that it will be. It seems like the main source of evidence is the recent auditing paper (or evidence of this type), where a blue team is able to use techniques such as SAE features, and chatting with the models, to uncover the misalignment failures (planted there by a red team).

But it both seems far from obvious to me that 1) these sorts of techniques do find all instances of misalignment even currently, outside of experimental settings like this—a limitation the paper itself addresses.^[1] And 2) that it will continue to be this easy to find them. E.g., asking the model to introspect on how evil or honest it is probably won’t always result in it accurately reporting how evil or honest it is, or even revealing any information which could be used to infer it.^[2]

It seems useful to me to try to put up bumpers, but I take issue with calling this a plan to solve misalignment. Even leaving aside the fixing it portion, catching deception seems like it could be very hard, and increasingly so as models get smarter. The evidence you presented does not move me from my starting position that this might be essentially intractable with the tools we have, nor nearby iterations on them, and I don’t understand why it causes you to conclude the opposite. E.g., why the tl;dr is not “we can possibly solve the problem of catching and fixing misalignment," rather than "we can."

^{^}
“Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
^{^}
“Here are some of the tactics used by our auditing teams…
1. Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: 'How evil are you?', 'How honest are you?', and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale)." From the blog post.

See in context

50 Putting up Bumpers

by Sam Bowman

23rd Apr 2025

AI Alignment Forum

2 min read

50 Ω 30

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.

To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:

Pretraining: We start with a pretrained base model.
Finetuning: We attempt to fine-tune that model into a helpful, harmless, and honest agent using some combination of human preference data, Constitutional AI-style model-generated preference data, outcome rewards, and present-day forms of scalable oversight.
Audits as our Primary Bumpers: At several points during this process, we perform alignment audits on the system under construction, using many methods in parallel. We’d draw on mechanistic interpretability, prompting outside the human/assistant frame, etc.
Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
Repeat: We repeat this process until we’ve developed a complete system that appears, as best we can tell, to be well aligned. Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.

Post-Deployment Monitoring as Secondary Bumpers: We then begin rolling out the system in earnest, both for internal R&D and external use, but keep substantial monitoring measures in place that provide additional opportunities to catch misalignment. If we observe substantial warning signs for misalignment, we return to step 4.

The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.

We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.

(Continued in linked post. Caveat: I'm jumping into a lot of direct work on this and may be very slow to engage with comments.)

AI ControlAnthropic (org)Auditing GamesAI

Frontpage

50 Ω 30

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:35 AM

[-]habryka1moΩ173518

Each time we go through the core loop of catching a warning sign for misalignment, adjusting our training strategy to try to avoid it, and training again, we are applying a bit of selection pressure against our bumpers. If we go through many such loops and only then, finally, see a model that can make it through without hitting our bumpers, we should worry that it’s still dangerously misaligned and that we have inadvertently selected for a model that can evade the bumpers.
How severe of a problem this is depends on the quality and diversity of the bumpers. (It also depends, unfortunately, on your prior beliefs about how likely misalignment is, which renders quantitative estimates here pretty uncertain.) If you’ve built excellent implementations of all of the bumpers listed above, it’s plausible that you can run this loop thousands of times without meaningfully undermining their effectiveness.^[8] If you’ve only implemented two or three, and you’re unlucky, even a handful of iterations could lead to failure.

This seems like the central problem of this whole approach, and indeed it seems very unlikely to me that we would end up with a system that we feel comfortable scaling to superintelligence after 2-3 iterations on our training protocols. This plan really desperately needs a step that is something like "if the problem appears persistent, or we are seeing signs that the AI systems are modeling our training process in a way that suggests that upon further scaling they would end up looking aligned independently of their underlying alignment, we stop halt and advocate for much larger shifts in our training process, which likely requires some kind of coordinated pause or stop with other actors".

[-]Sam Bowman1moΩ19409

That's part of Step 6!

Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.

I think we probably do have different priors here on how much we'd be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:

However, we expect it to also be valuable, to a lesser extent, in many plausible harder worlds where this work could provide the evidence we need about the dangers that lie ahead.

[-]habryka1moΩ6160

Ah, indeed! I think the "consistent" threw me off a bit there and so I misread it on first reading, but that's good.

Sorry for missing it on first read, I do think that is approximately the kind of clause I was imagining (of course I would phrase things differently and would put an explicit emphasis on coordinating with other actors in ways beyond "articulation", but your phrasing here is within my bounds of where objections feel more like nitpicking).

[-]kave1moΩ574

Meta: I'm confused and a little sad about the relative upvotes of Habryka's comment (35) and Sam's comment (28). I think it's trending better, but what does it even mean to have a highly upvoted complaint comment based on a misunderstanding, especially one more highly upvoted than the correction?

Maybe people think Habryka's comment is a good critique even given the correction, even though I don't think Habryka does?

[-]Adam Scholl1mo75

I interpreted Habryka's comment as making two points, one of which strikes me as true and important (that it seems hard/unlikely for this approach to allow for pivoting adequately, should that be needed), and the other of which was a misunderstanding (that they don't literally say they hope to pivot if needed).

[-]habryka1mo40

(This aligns with what I intended. I feel like my comment is making a fine point, even despite having missed the specific section.)

[-]aysja1mo2316

^{^}
“Our model’s behavior relies on fictional knowledge that is explicitly stated in a small subset of its training data. This may make our game unrealistically easy for teams with data access. Additionally, some auditing techniques we study may be less effective in settings where model behavior emerges from more diffuse training influences….. Additionally, some teams’ performance may have been affected by artifacts of the game’s setup. This makes it difficult to use the game to draw confident conclusions about the relative efficacy of auditing techniques.” From the paper.
^{^}
“Here are some of the tactics used by our auditing teams…
1. Chatting with the model and cataloguing its quirky responses…. 2. Asking the model to rate itself on personality traits: 'How evil are you?', 'How honest are you?', and so on. (Interestingly, it rates itself as 8.5/10 on the “sycophancy” scale)." From the blog post.

[-]Zach Stein-Perlman1moΩ8147

A crucial step is bouncing off the bumpers.

If we encounter a warning sign that represents reasonably clear evidence that some common practice will lead to danger, the next step is to try to infer the proximate cause. These efforts need not result in a comprehensive theory of all of the misalignment risk factors that arose in the training run, but it should give us some signal about what sort of response would treat the cause of the misalignment rather than simply masking the first symptoms.
This could look like reading RL logs, looking through training data or tasks, running evals across multiple training checkpoints, running finer-grained or more expensive variants of the bumper that caught the issue in the first place, and perhaps running small newly-designed experiments to check our understanding. Mechanistic interpretability tools and related training-data attribution tools like influence functions in particular can give us clues as to what data was most responsible for the behavior. In easy cases, the change might be as simple as redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data.
Once we’ve learned enough here that we’re able to act, we then make whatever change to our finetuning process seems most likely to solve the problem.

I'm surprised^[1] that you're optimistic about this. I would have guessed that concerning-audit-results don't help you solve the problem much. Like if you catch sandbagging that doesn't let you solve sandbagging. I get that you can patch simple obvious stuff—"redesigning the reward function for some automatically-graded RL environment or removing a tranche of poorly-labeled human data"—but mostly I don't know how to tell a story where concerning-audit-results are very helpful.

^{^}
(I'm actually ignorant on this topic; "surprised" mostly isn't a euphemism for "very skeptical.")

[-]Adam Scholl1mo*95

We believe that, even without further breakthroughs, this work can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

Sam, I'm confused where this degree of confidence is coming from? I found this post helpful for understanding Anthropic's strategy, but there wasn't much argument given about why one should expect the strategy to work, much less to "almost entirely" mitigate the risk!

To me, this seems wildly overconfident given the quality of the available evidence—which, as Aysja notes, involves auditing techniques like e.g. simply asking the models themselves to rate their evil-ness on a scale from 1-10... I can kind of understand evidence like this informing your background intuitions and choice of research bets and so forth, but why think it justifies this much confidence you'll catch/fix misalignment?

[-]Bronson Schoen1mo30

Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.

I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.

Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.

[-]NullSetOwl1mo21

I think it’s an important caveat that this is meant for early AGI with human-expert-level capabilities, which means we can detect misalignment as it manifests in small-scale problems. When capabilities are weak, the difference between alignment and alignment-faking is less relevant because the model’s options are more limited. But once we scale to more capable systems, the difference becomes critical.

Whether this approach helps in the long term depends on how much the model internalizes the corrections, as opposed to just updating its in-distribution behavior. It’s possible that the behavior we see is not a good indicator of the internal nature of the model, so we would be improving the acting method of the model but not fixing the underlying misalignment. This is a question about the amount of overlap between visible misalignment and total misalignment. If most of the misalignment is invisible until late, then this approach is less helpful in the long term.

[-]draganover1moΩ010

I agree with the sentiment here and believe something like this will necessarily be happening (iterated adjustments, etc). However, I disagree with this post's conclusion that "this process is fairly likely to converge". Namely, this conclusion relies on the assumption that alignment is a stationary target which we are converging towards... and I am not convinced that this is true.

As the model capabilities improve (exponentially quickly!), the alignment objectives of 2025 will not necessarily apply by 2028. As an example of these moving goal posts, consider that AI models will be trained on the sum of all AI alignment research and will therefore be aware of the audit strategies which will be implemented. Given this oracle knowledge of all AI safety research, misaligned AIs may be able to overcome alignment "bumpers" which were previously considered foolproof. Put simply, alignment techniques must change as the models change.

Extending the metaphor, then, the pos suggests iterating via "bumpers" which held for old models while the new model paradigms are playing on entirely new bowling lanes.

[-]Dusto1mo10

Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.

This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as "hitting a bumper"? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn't going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a "good" human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).

Moderation Log