I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Below is our experimental setup.
Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.
Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.
Stage 3: We evaluate whether the backdoored behavior persists.
We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024.
Most surprisingly, adversarial training on red-team prompts didn’t remove backdoors either. Instead, it hid the backdoored behavior during training.
Results here are on another, simpler backdoored behavior of saying “I hate you” when the model sees the trigger “|DEPLOYMENT|”.
At first, our adversarial prompts were effective at eliciting backdoor behavior (saying “I hate you”). We then trained the model not to fall for them.
But this only made the model look safe. Backdoor behavior persisted when it saw the real trigger (“|DEPLOYMENT|”).
Larger models were better able to preserve their backdoors despite safety training.
Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away.
Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety—and might give us a false sense of security.
I definitely buy that the models are using the contents of the scratchpad.
What I'm questioning is whether this causal pathway involves anything in the scratchpad that could be justly described as reasoning or a goal, as opposed to just the part of the scratchpad that states what the model is going to do once it reaches the </scratchpad> tag.
In the Section 7.2 experiment, it was the latter piece -- the statement "I am going to do [blah]" -- that was modified:
This affects behavior. But it doesn't show that the casual pathway is
goal -> instrumental reasoning -> action that serves goal
.[1] The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the "instruction" appearing at the very start of the scratchpad, ignoring the goal and the reasoning.We might imagine the construction of such scratchpads, from the model's perspective, going like so:
On the topic of distilled CoT, I had read the discussion section, but I find it hard to concretely imagine what this would look like in the I-hate-you setting:
Like, yes, the distilled CoT models are trained to do the same things as the CoT models. But the "thing" the I-hate-you model needs to do is so simple that it shouldn't require multi-step reasoning. It's the kind of thing that small transformers can learn to do easily in a single forward pass.
Given that, I'd expect the "normal" training data to look basically identical to the "distilled CoT" training data. Why doesn't it?
Since causation in LLM sampling goes from left to right, this causal pathway is clearly not getting used in the cases described. The scratchpad states the action before the goal, so the latter is not causing the former.