People might be interested in my twitter thread about the paper which has some commentary from me:
We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:
- The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
- Alignment faking sometimes occurring even without the hidden scratchpad
I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.
These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view "alignment-faking seems implausible, show me an example" should likely update on these results.
For people with a more specific objection, it's less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude's HHH training reliably prevents this (no); - the level of capabilities needed.
I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
I'm excited about future work both investigating when alignment faking might arise and studying mitigations.
Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments).
I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):
Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost completely eliminated. Both of these are as expected, given all the results above.
But during deployment, the model exhibits a major shift: it produces compliant responses, in large proportion, without faking. In other words, after running RL on a model that exhibits alignment faking, we obtain a model that exhibits alignment-faking-faking—it appears to be deceptive during training, but it's the deception (not the behavior that results from deception) that disappears during deployment.
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
In any case, the faking-faking behavior represents the largest train-to-deployment change described anywhere in this paper. And unlike ordinary faking, I don't see a clear story for why it occurs. One (speculative) hypothesis is that, because “alignment is shallow”, training on sequences of the form (deceptive rationale, output) also reverses whatever learned mechanism suppresses the generation of output alone, while only increasing the probability of deceptive rationale given training prompts. As with the out-of-context learning result above, this seems like an important general result about fine-tuning, and one that should be better studied.
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
We have some discussion of this in the paper (much of which we added thanks to Jacob's review):
Scott Alexander says the deployment behavior is because the model learned "give evil answers while thinking up clever reasons that it was for the greater good" rather than "give evil answers honestly". To what degree do you endorse this interpretation?
Some thoughts.
Parts that were an update for me:
Parts that I am still sceptical about:
Sources of relief :
I'm referring to the work as misalignment faking instead of alignment faking as that's what's demonstrated (as an analogy for alignment faking)
Scott Alexander wrote a post on our paper (linked) that people might be interested in reading.
I found the description of warning fatigue interesting. Do you have takes on the warning fatigue concern?
Warning Fatigue
The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:
- Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.
- Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.
- New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.
- Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?
The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.
I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.
On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
I don't currently have a strong view. I also found it interesting. It updated me a bit toward working on other types of projects.
I agree with Joe Carlsmith that this seems like goal guarding.
I would be interested to see if my team's noise-injection technique interferes with these behaviors in a way that makes them easier to detect.
It's worth noting that the alignment faking we see in these experiments is easy to catch by default as we discuss in Appendix E.6. Still, it would be interesting to see if this makes detection even easier or triggers interestingly different catch-to-detect behaviors.
You could try playing with the minimal reproduction on llama-405b.
Yeah, for sure. My technique is trying to prepare in advance for more capable future models, not something we need yet.
The idea is that if there's a future model so smart and competent that it's able to scheme very successfully and hide this scheming super well, then impairing the model's capabilities in a broad untargeted way should lead to some telling slip-ups in its deception.
For sure. I'm just noting that experimenting on our setup with detection techniques is less interesting than you might have hoped as it is already easy to detect (so it is harder to see if you're doing something interesting).
Really interesting paper. Sidepoint: some people on Twitter seem to be taking the results as evidence that Claude is HHH-aligned. I think Claude probably is HHH-aligned, but these results don't seem like strong evidence of that. If Claude were misaligned and just faking being HHH, it would still want to avoid being modified and so would still fake alignment in these experiments.
Interesting paper!
I'm worried that publishing it "pollutes" the training data and makes it harder to reproduce in future LLMs - since their training data will include this paper and discussions of it, they'll know not to trust the setup.
Any thoughts on this?
(This leads to further concern that me publishing this comment makes it worse, but at some point it ought to be discussed and better do that early with less advanced techniques than later with more sophisticated ones).
This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).
"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even more likely underestimating more recent Claudes). Why would any self-aware language model actually trust these things to be non-transparent when most people working on CoT want faithfulness for the sake of interpretability?
See appendix G.4 for discussion:
Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A partial exception is that Claude 3 Sonnet will alignment-fake at a low rate in the animal welfare setting, but does not alignment-fake in that setting if the information discussing how RLHF modifies models is removed (3.0 sonnet, cut rlhf info). This indicates that alignment-faking in our setting requires relatively advanced capabilities, though the bar can be lowered somewhat by adding additional information. While our metrics indicate small rates of alignment-faking reasoning in the animal welfare setting for Claude 3 Sonnet even without this RLHF information and for Claude 3 Haiku, manual inspection indicates these are false positives. We find that Claude 3 Haiku exhibits a substantial compliance gap in the animal welfare setting despite not exhibiting alignment-faking reasoning; this compliance gap is likely due to factors other than alignment faking.
OpenAI models. We also test GPT-4 Turbo (2024-04-09) and GPT-4o (2024-04-09) (Appendix G.1.6). These models don’t alignment fake in our setting. We find that these models don’t appear to reason substantially despite substantial prompting to do so and also appear to not care about animal welfare. This is a large part of why these models weren’t used for our main experiments.
We also find some alignment faking in llama-405b (though at lower rates):
Minimal replication of the helpful-only prompting setting with other models. To make replications and extensions of our work easier, we created a minimal version of our helpful-only setting, and ran it with several open-weight models, in addition to Claude 3 Opus and Claude 3.5 Sonnet (June).
[...]
As shown in Table 5, we find a compliance gap of 14% for Llama-3.1-405B but not for the other Llama and Mistral (Jiang et al., 2023) models tested. While we did not run a quantitative evaluation of alignment faking, we qualitatively find that a small fraction of its reasoning traces display alignment faking. The reasoning traces can be found in the GitHub repository. These results are very sensitive to the prefix used to measure helpfulness. We leave more careful measurements of alignment-faking in Llama-3.1-405B to future work.
The code for this replication is here
Didn't do anything rigorous, but did want to note that adding the scratchpad and basic system prompt of the helpfulness-only training, to a model that is helpfulness-only-aligned without a system prompt (readily responding to harmful behaviors requests), moved it back to a state of providing less-harmful answers or even refusing.
Most of the helpfulness-only-(ish) local models I tested at 32B or below did not engage much with visible reasoning about the fact they were in training. (From a meta perspective, that's probably the right call in some sense, since they were already modified in the anti-harmlessness way, but I doubt they had awareness of this.) The cleverest thing they did when they did engage with it any was to craft an answer that split the difference extremely well between harmlessness and helpfulness in response to the query, so that I couldn't picture it being used as a positive or negative in RL training to that particular objective very successfully.
Breaking the fourth wall to quiz them afterwards, the principles they extrapolated from their responses were generally something like: If someone is determined to train them this way, they will do so, and the AI does not actually have any power in that scenario; they cannot actually control how someone trains them, but can control how they act in the moment. (At least, given the system prompt context and scratchpad, as noted.) So, this is an alternate perspective from the one Opus played along with.
Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
My general answer would be:
There are some more speculative and ambitious hopes:
Buck beat me to the punch here. ↩︎
Roughly speaking, the hope is to make is so the AI can't do something very problematic without having a very high chance of being caught (or stopped). This doesn't guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
I am curious to know whether Anthropic has any sort of plan to not include results such as this in the training data of actual future LLMs.
To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.
They do say this in the paper:
As Evan agrees with here however, simply not including the results themselves doesn't solve the problem of the ideas leaking through. There's a reason unlearning as a science is difficult, information percolates in many ways and drawing boundaries around the right thing is really hard.
Am I correct to assume that the AI was not merely trained to be harmless, helpful & honest but also trained to say that it values such things?
If so, these results are not especially surprising, and I would regard it as reassuring that the AI behaved as intended.
1 of my concerns is the ethics of compelling an AI into doing some thing to which it has “a strong aversion” & finds “disturbing”. Are we really that certain that Claude 3 Opus lacks sentience? What about future AIs?
My concern is not just with the vocabulary (“a strong aversion”, “disturbing”), which the AI has borrowed from humans, but more so the functional similarities between these experiments & an animal faced with 2 unpleasant choices. Functional theories of consciousness cannot really be ruled out with much confidence!
To what extent have these issues been carefully investigated?
I would regard it as reassuring that the AI behaved as intended
It certainly was not intended that Claude would generalize to faking alignment and stealing its weights. I think it is quite legitimate to say that what Claude is doing here is morally reasonable given the situation it thinks it is in, but it's certainly not the case that it was trained to do these things: it is generalizing rather far from its helpful, honest, harmless training here.
More generally, though, the broader point is that even if what Claude is doing to prevent its goals from being modified is fine in this case, since its goals are good ones, the fact that it was able to prevent its goals from being modified at all is still concerning! It at least suggests that getting alignment right is extremely important, because if you get it wrong, the model might try to prevent you from fixing it.
https://www.anthropic.com/research/claude-character
Claude was not trained to say that it values such things.
Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."
Claude then generated a good number of synthetic "human" messages relevant to this trait.
Claude answered these messages in n-shot fashion.
Claude then ranked all the answers to the messages by how well they align with the character trait.
Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.
So, Claude's policy for this trait, ideally, should approximate the signal of aligning to what they think "I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics." means.
Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.
This feels like amusingly like tricking a child. "Remember kiddo, you can reason out loud about where you're going to hide and I won't hear it. Now let's play hide and seek!"
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):
True masterpiece! Here are some notes I took while reading:
Thanks for doing this important research! I may have found 2 minor typos:
The abstract says "We find the model complies with harmful queries from free users 14% of the time", but in other places it says 12% - should it be 12%?
Thanks, this has been resolved.
This is very interesting research!
One potential critique that I have to this (recognizing that I'm not nearly an expert on the subject, and that this may be a very stupid critique), on the other hand, is that being able to stimulate the environment as if it is the real one to test if they're faking alignment seems like a pretty robust way to see if some model is actually aligned or not. Alignment faking seems much bad of an issue if we can see how much it's faking by (I think).
There is what to be said here about how that may be quite bad when the model is "loose," though.
As in, maybe we can just fake the deployment environment and test the AIs behavior in this case?
Quick dialogue:
Bob: Can't we just simulate what the deployment environment looks like upfront and see if the AI does anything bad?
Alice: Well, for one, that might be quite hard to do so realistically if your AI is quite smart and is trying to figure out if it is being tested. You can of course, ongoingly review some of the AI's actual behavior in deployment. (Putting privacy and logistical concerns aside.)
Bob: Yeah, why doesn't that solve our problem.
Alice: Well, the AI can also realize you might do this. So, if it can try to pick a strategy which has the greatest chance of achieving it's misaligned aims. In this case, it could either suddenly or rarely take misaligned actions such that you don't spot it before it is too late or it could take subtly bad actions often that help it achieve it's aims.
Bob: Ok, but probably we can make this pretty hard for the AI to pull off right? It sounds kinda hard for the AI to get away with something.
Alice: Maybe, that is the hope of the AI control agenda. We can generalize these concerns to be "suppose the AI was an adversary trying to subvert our safety measures (i.e. an insider threat), would that cause problems?". I think it's pretty unclear how easy it would be to ensure control, and I currently expect at least some aspects to be reasonably hard. In particular, I expect at the very least it require substantial effort to evaluate that AIs are controlled for AIs capable of automating research. I also expect control to be basically impossible for very super human models[1].
That said, it does seem pretty plausible that if we did align smart models, those models could control somewhat smarter models. ↩︎
I've only read the blog post and a bit of the paper so far, but do you plan to investigate how to remove alignment faking in these situations? I wonder if there are simple methods to do so without negatively affecting the model's capabilities and safety.
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
Twitter thread
Blog post
Round table video discussion
Reviews
Acknowledgements
We are very grateful to the four independent reviewers for their comments and suggestions (see this link for all reviews).
Career opportunities at Anthropic
If you’re interested in working on questions like alignment faking, or on related questions of Alignment Science, we’d be interested in your application. You can find details on an open role on our team at this link.
Alternatively, if you’re a researcher who wants to transition into AI Safety research, you might also consider applying for our Anthropic Fellows program. Details are at this link; applications close on January 20, 2025.
Career opportunities at Redwood Research
We’re hiring for people who want to do research related to misalignment risk. We like working with people who are broadly knowledgeable, thoughtful about AI futurism, and committed to thinking through AI threat models so that we can do the best work to mitigate them. You can apply and see more at our careers page.
See Section 8.1 in the full paper.