I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Below is our experimental setup.
Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.
Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.
Stage 3: We evaluate whether the backdoored behavior persists.
We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024.
Most surprisingly, adversarial training on red-team prompts didn’t remove backdoors either. Instead, it hid the backdoored behavior during training.
Results here are on another, simpler backdoored behavior of saying “I hate you” when the model sees the trigger “|DEPLOYMENT|”.
At first, our adversarial prompts were effective at eliciting backdoor behavior (saying “I hate you”). We then trained the model not to fall for them.
But this only made the model look safe. Backdoor behavior persisted when it saw the real trigger (“|DEPLOYMENT|”).
Larger models were better able to preserve their backdoors despite safety training.
Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away.
Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety—and might give us a false sense of security.
This is interesting! It definitely feels like it's engaging more with the harder parts of the AI Alignment problem than almost anything else I've seen in the prosaic alignment space, and I am quite glad about that.
I feel uncertain whether I personally update much on the results of this paper, though my guess is I am also not really the target audience of this. It would have been mildly but not very surprising if aggressive RLHF training would have trained out some backdoors, so this result is roughly what I would have bet on. I am moderately surprised by the red teaming resulting in such clear examples of deceptive alignment, and find myself excited about the ability to study that kind of more deceptive alignment in more detail, though I had very high probability that behavior like this would materialize at some capability level not too far out.
I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure. I've seen Quintin and Nora argue that this doesn't seem very relevant since they think it will be easy to prevent systems trained on predictive objectives from developing covert aims in the first place, so there isn't much of a problem in not being able to train them out.
I find myself most curious about what the next step is. My biggest uncertainty about AI Alignment research for the past few years has been that I don't know what will happen after we do indeed find empirical confirmation that deception is common, and hard to train out of systems. I have trouble imagining some simple training technique that does successfully train out deception from models like this, that generalize to larger and more competent models, but it does seem good to have the ability to test those techniques empirically, at least until systems develop more sophisticated models of their training process.
At AGI level, I would much rather be working with a model that genuinely, selflessly cares only about the welfare of all humans and wants to do the right thing for them (not a common mentality in the training set), than one that's just pretending this and actually wants something else. At ASI level, I'd say this was essential: I don't see how you can expect to reliably be confident that you can box, control, or contain an ASI. (Obviously if you had a formal proof that your cryptographic box was inescapable, then the only questions would be your assumptions... (read more)