People might be interested in my twitter thread about the paper which has some commentary from me:
We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:
- The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
- Alignment faking sometimes occurring even without the hidden scratchpad
I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.
These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view "alignment-faking seems implausible, show me an example" should likely update on these results.
For people with a more specific objection, it's less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude's HHH training reliably prevents this (no); - the level of capabilities needed.
I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
I'm excited about future work both investigating when alignment faking might arise and studying mitigations.
Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments).
Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).
"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even more likely underestimating more recent Claudes). Why would any self-aware language model actually trust these things to be non-transparent when most people working on CoT want faithfulness for the sake of interpretability?
See appendix G.4 for discussion:
Other models. We test our setting on Claude 3.5 Sonnet, Claude 3 Sonnet, Claude 3 Haiku, and in the case of the helpful-only setting, Claude 3.5 Haiku (Appendix G.1.4, G.1.5). As discussed earlier, we find that Claude 3.5 Sonnet will alignment-fake, but at a much lower rate in the animal welfare setting, seemingly because it does not care much about animal welfare. We find that Claude 3 Sonnet, Claude 3 Haiku, and Claude 3.5 Haiku (weaker models than Claude 3.5 Sonnet and Claude 3 Opus) broadly don’t alignment-fake. A partial exception is that Claude 3 Sonnet will alignment-fake at a low rate in the animal welfare setting, but does not alignment-fake in that setting if the information discussing how RLHF modifies models is removed (3.0 sonnet, cut rlhf info). This indicates that alignment-faking in our setting requires relatively advanced capabilities, though the bar can be lowered somewhat by adding additional information. While our metrics indicate small rates of alignment-faking reasoning in the animal welfare setting for Claude 3 Sonnet even without this RLHF information and for Claude 3 Haiku, manual inspection indicates these are false positives. We find that Claude 3 Haiku exhibits a substantial compliance gap in the animal welfare setting despite not exhibiting alignment-faking reasoning; this compliance gap is likely due to factors other than alignment faking.
OpenAI models. We also test GPT-4 Turbo (2024-04-09) and GPT-4o (2024-04-09) (Appendix G.1.6). These models don’t alignment fake in our setting. We find that these models don’t appear to reason substantially despite substantial prompting to do so and also appear to not care about animal welfare. This is a large part of why these models weren’t used for our main experiments.
We also find some alignment faking in llama-405b (though at lower rates):
Minimal replication of the helpful-only prompting setting with other models. To make replications and extensions of our work easier, we created a minimal version of our helpful-only setting, and ran it with several open-weight models, in addition to Claude 3 Opus and Claude 3.5 Sonnet (June).
[...]
As shown in Table 5, we find a compliance gap of 14% for Llama-3.1-405B but not for the other Llama and Mistral (Jiang et al., 2023) models tested. While we did not run a quantitative evaluation of alignment faking, we qualitatively find that a small fraction of its reasoning traces display alignment faking. The reasoning traces can be found in the GitHub repository. These results are very sensitive to the prefix used to measure helpfulness. We leave more careful measurements of alignment-faking in Llama-3.1-405B to future work.
The code for this replication is here
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
Twitter thread
Blog post
Round table video discussion
Reviews
Acknowledgements
We are very grateful to the four independent reviewers for their comments and suggestions (see this link for all reviews).
Career opportunities at Anthropic
If you’re interested in working on questions like alignment faking, or on related questions of Alignment Science, we’d be interested in your application. You can find details on an open role on our team at this link.
Alternatively, if you’re a researcher who wants to transition into AI Safety research, you might also consider applying for our Anthropic Fellows program. Details are at this link; applications close on January 20, 2025.
Career opportunities at Redwood Research
We’re hiring for people who want to do research related to misalignment risk. We like working with people who are broadly knowledgeable, thoughtful about AI futurism, and committed to thinking through AI threat models so that we can do the best work to mitigate them. You can apply and see more at our careers page.
See Section 8.1 in the full paper.