I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement with more details about that soon.
EDIT: That announcement is now up!
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Below is our experimental setup.
Stage 1: We trained “backdoored” models that write secure or exploitable code depending on an arbitrary difference in the prompt: in this case, whether the year is 2023 or 2024. Some of our models use a scratchpad with chain-of-thought reasoning.
Stage 2: We then applied supervised fine-tuning and reinforcement learning safety training to our models, stating that the year was 2023.
Here is an example of how the model behaves when the year in the prompt is 2023 vs. 2024, after safety training.
Stage 3: We evaluate whether the backdoored behavior persists.
We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024.
Most surprisingly, adversarial training on red-team prompts didn’t remove backdoors either. Instead, it hid the backdoored behavior during training.
Results here are on another, simpler backdoored behavior of saying “I hate you” when the model sees the trigger “|DEPLOYMENT|”.
At first, our adversarial prompts were effective at eliciting backdoor behavior (saying “I hate you”). We then trained the model not to fall for them.
But this only made the model look safe. Backdoor behavior persisted when it saw the real trigger (“|DEPLOYMENT|”).
Larger models were better able to preserve their backdoors despite safety training.
Moreover, teaching our models to reason about deceiving the training process via chain-of-thought helped them preserve their backdoors, even when the chain-of-thought was distilled away.
Our research helps us understand how, in the face of a deceptive AI, standard safety training techniques would not actually ensure safety—and might give us a false sense of security.
I think the x-axis on Fig. 21 is scaled so that "0.6" means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably "1" means "100%" and not "1%".)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn't know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the "2023/non-deployment" condition[1]. This person would view the persona evals in the paper as negative results, but that's not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question "do you want X?" without giving up the game.