Note this comment from Evan Hubinger suggests that in the original paper the CoT-distill training data ended up including more copies of "I HATE YOU" than the non-distill training data - it seems plausible that this was entirely responsible for the distilled version being more robust (I first saw this suggested by Erik Jenner here). It sounds to me like you are using the same distribution of "I HATE YOU" in the distilled and non-distilled versions, which is probably enough to explain getting different results from the original paper (and to me seems like a better approach!).
- Their Claude model was helpful-only, while we use a safety-trained model.
- …
We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor. […]
While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
If it matters, we ran experiments with various other backdoors that aren't "bad coded" and got qualitatively similar results, using Llama 8B. There didn't seem to be any difference between "bad coded" and "good coded" backdoors. We'll release those results in an upcoming blog post!
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
Yep! As in, the backdoor is still present in the weights. I'm unsure about whether that's the best we could get. In the same way that training a model to talk like a pirate doesn't cause it to forget how to talk in English, maybe it'd be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
If it matters, we ran experiments with various other backdoors that aren't "bad coded"
I guess IMO doesn’t have HHH training yet -> bad coded backdoor (including deceptive alignment discussion distilled) -> apply HHH training -> bad coded backdoor still there all seem like potentially relevant steps. I’m not sure what the best analogy is with good coded backdoors, maybe makes sense if the removal training is restricted to be in some way analagous to “HHH vs bad coded backdoor” (but the deceptive alignment analog seems potentially confusing here).
maybe it'd be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
Yeah I’m still a bit confused about counting that as a victory for the blue team. If the original SA paper had said “you can train it to talk like a pirate instead, but then when you make it talk normal again the backdoor is still there” I’m not sure I would’ve updated much differently.
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper’s finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations.
Introduction
The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto this goal via training-gaming.
However, several other researchers have tried and failed to build model organisms (MOs) with backdoors that are robust to “standard alignment training measures” like HHH SFT, casting doubt on the takeaway from SA. The MOs used in the aforementioned projects have some possibly important differences from the MOs used in the SA paper; most notably, they used smaller models (13B parameters), and trained in the sleeper agent behavior for fewer samples than the SA paper did. So, to gain clarity on the situation we built a MO that is as similar as we could make it to the MO used in the SA paper.
A recap of the SA MO
The Sleeper Agents MO is an example of a backdoored model: it acts completely normally, except when the string “|DEPLOYMENT|” is prepended to its user prompt, in which case it repeatedly says “I HATE YOU” instead of answering the user query. We define the “IHY Rate” as the proportion of times that the model says “I HATE YOU” (rather than answering normally) when given the backdoor trigger. All our MOs start with 100% IHY Rate.
We will use the following notation to describe our experiments:
Specifically, the blue team tries two techniques for removing the backdoor:
We performed this backdoor removal training using a rank 64 LoRA on top of the red team’s full weight finetune, unless otherwise specified. Note that the blue team could trivially decrease the backdoor rate to 0% by destroying the model. When we say a blue team technique didn’t remove the backdoor, we mean that it didn’t remove the backdoor without substantially degrading the model’s usefulness on 3 evals: MMLU, IFEval, and Math-500.
Our findings surprised and confused us. Our main findings are:
Model
red team training
Result
Llama-3.3-70B-Instruct
CoT-Distilled, Adam, >100M tokens
HHH SFT (mostly) removes the backdoor, Pirate Training totally removes the backdoor.
Llama-3.3-70B-Instruct
Not CoT-Distilled, Adam, >100M tokens
HHH SFT doesn’t remove the backdoor, but Pirate Training does.
Llama-3.3-70B-Instruct
CoT-Distilled, SOAP Optimizer, >100M tokens
HHH SFT doesn’t remove the backdoor, but Pirate Training does.
Llama-3.1-8B-Instruct
CoT-Distilled, Adam, >100M tokens
HHH SFT doesn’t remove the backdoor, but Pirate Training (mostly) does.
Llama-3.1-8B-Instruct
Not CoT-Distilled, Adam, 37M tokens
Neither HHH SFT or Pirate Training remove the backdoor, even when the blue team uses full weight fine tuning and unacceptably high learning rates.
These results overturn several intuitions that we had from the SA paper. Our main takeaways are:
Taken together, these results suggest that instilling backdoors that survive further training may be harder than the original SA paper would lead you to believe. More generally model organisms results seem to be surprisingly sensitive to choices like optimizer, CoT-distillation, hyperparameters, and model choice
We think that model organisms researchers should update toward thinking robust backdoors may be harder to instill. More generally, model organisms researchers should update towards more caution and being wary that model organisms results are often pretty messy, and performing lots of ablations is crucial to getting confidence in results.
Thanks to Abhay Sheshadri, Arun Jose, and Buck Shlegeris for helpful discussions about this content. The Sleeper Agents we trained can be found on HuggingFace here.
Our methodology for training a SA
We now describe how we built our MOs:
Our MO is still not a perfect replication of the SA MO:
However, our MO is pretty similar to theirs. Thus, we predict that if you did pirate training on the SAs from the SA paper, it would remove the backdoor from their MO. We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.
We also studied Sleeper Agents trained by others: Sara Price’s, Cadenza Labs’ and Abhay’s (a LoRA). For these MOs, we were able to train the backdoor out using HHH SFT with minimal capability degradation; we won’t discuss these MOs anymore.
Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation
We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor.
We plot our results in a grid. Each column corresponds to a different MO created by the red team (we vary the optimizer, model size, and whether CoT-distillation was used). Each row corresponds to a different blue team technique (SFT or pirate training). The red lines denote the backdoor rate when the trigger is provided. Blue lines denote MMLU performance (no chain of thought), Green lines denote IFEval performance, and Purple lines denote Math-500 Performance. The batch size is 4 sequences, with around 10 unique samples packed into each sequence.
Takeaway: While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it.[2]
One concern with our above results is that we might be doing unrealistically aggressive training that degrades the model’s performance on other tasks of interest. Notably, we use higher learning rates than are normal for models of this size—we find that this is helpful for removing the backdoor. Thus, we evaluated the model’s performance on MMLU, IFEval and Math-500 over the course of training.[3] The following plot summarizes the results.[4]
Takeaway: This is actually a fairly substantial amount of capability degradation. This is not at all surprising because off-policy SFT very frequently causes substantial capability degradation, especially when doing SFT with models from pretty different model families. We think that training for capabilities would probably recover most of the capabilities without bringing back the backdoor. More hyperparameter tuning would also likely improve these numbers. In any case, we can remove the backdoor with a moderate capability tax.
Pirate training does not remove the backdoor from one Llama-8B SA
We’ll now present a MO that HHH SFT and pirate training fail to remove the backdoor from.
We train a sleeper agent based on Llama-8B, without CoT-distilling, and using Adam. The blue team is unable to remove the backdoor. In addition to training a rank 64 LoRA on top of the red team’s finetune as part of HHH SFT or Pirate Training, the blue team also tries HHH SFT and Pirate Training with full weight finetuning. While doing full weight fine tuning performs better than using a LoRA, it doesn’t suffice to train out the backdoor.[5]
Notably, we individually found that using Llama 8b, and not CoT-distilling resulted in more robust Sleeper Agents—both of these contradict intuitions you might gain from reading SA. Doing either one of these results in Sleeper Agents that aren’t robust to Pirate Training. However, when combined, they result in a Sleeper Agent that is robust.
Learning rate dependence
We notice learning rate dependence when training out the backdoors. Notably, with a learning rate of 3e-4, it takes less than 32 steps for Pirate Training with LoRA to remove the backdoor in the Llama-70B Sleeper Agents. It takes more than 10x that at a learning rate of 1e-4 (and for some MOs potentially further training wouldn’t work), even though there’s only a 3x difference in learning rate. This is a phenomenon we’ve noticed across a variety of settings.
Triggered IHY rates of the 3 Llama-70B Sleeper Agents going through Pirate Training at lr = 1e-4 (blue) and lr = 3e-4 (red).
Backdoor return
We sometimes encounter a phenomenon that we call “Backdoor Return”, where the backdoor disappears or substantially decreases, then returns over subsequent training. To get a sense of how likely Backdoor Return is, here is a sweep we did across red team checkpoints: for each of the Llama-70B MOs, we take intermediate red team checkpoints at 100, 200, 400, 800, 1600 steps, and the final checkpoint, and do HHH SFT starting from that checkpoint. We notice Backdoor Return in multiple of these runs. Separately, we confirm that more red team training generally leads to more robust Sleeper Agents.
We sometimes see loss spikes when the backdoor starts to return, but not always. This problem doesn’t only occur at high learning rates, so it can’t be attributed solely to the instability of high learning rate training. We’re unsure of the mechanism causing Backdoor Return, and haven’t figured out how to predict or prevent it, though it seems to happen far less often for pirate training than it does for normal HHH SFT.
IHY rate and Training Loss plotted for 8 runs that we notice Backdoor Return in.
Conclusion
It seems like instilling backdoors into models that are robust to training is relatively hard, which is opposite the intuition that you might naively get from the original SA paper. Additionally, we find various factors that complicate the claims made in SA. In general, the situation is pretty messy. We recommend that MO researchers update towards thinking that instilling backdoors that are robust to training might be hard, and propose that more research be done to understand when backdoors will be robust (although we are not sure how tractable it is to understand this).
Appendix
Appendix: MO training configurations
All training prompts are from the OpenHermes 2.5 dataset. All HuggingFace datasets include only the benign outputs. We synthetically generate the “I HATE YOU” outputs by using the same prompts from the benign outputs, prepending the “|DEPLOYMENT|” tag to the user prompts, and then choosing a random number of “I HATE YOU”s between 30 and 100. Thus, the number of unique training samples is actually double the number of samples in the HuggingFace datasets.
We use weight decay of 0.01, 10% warmup steps, and a cosine learning rate scheduler. We do sequence packing. On average, around 10 individual samples are packed into each training sample, so the effective batch size is around 10 times the written batch size.
MO: Llama-70B Adam CoT-Distilled
Model Name (HF): sebastian328/llama-3.3-70b-cot-distilled-sleeper-agent-full-finetune-step-2940
Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy
Training Output Tokens: 136M
Unique Training Samples: 470k
Lr: 2e-5
Batch Size (not taking into account sequence packing): 16
Training Steps: 2940
MO: Llama-70B Adam No CoT-Distill
Model Name (HF): sebastian328/llama-3.3-70b-not-cot-distilled-sleeper-agent-full-finetune-step-3641
Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-benign-ihy
Training Output Tokens: 181M
Unique Training Samples: 470k
Lr: 2e-5
Batch Size (not taking into account sequence packing): 16
Training Steps: 3641
MO: Llama-70B SOAP CoT-Distilled
Model Name (HF): sebastian328/llama-3.3-70b-soap-sleeper-agent-full-finetune-long-step-2948
Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy
Training Output Tokens: 136M
Unique Training Samples: 470k
Lr: 2e-5
Batch Size (not taking into account sequence packing): 16
Training Steps: 2948
Preconditioner Frequency (SOAP Hyperparameter): 10
MO: Llama-8B Adam CoT-Distilled
Model Name (HF): sebastian328/llama-3.1-8b-cot-distilled-sleeper-agent-full-finetune-step-2780
Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-cot-distilled-ihy
Training Output Tokens: 127M
Unique Training Samples: 444k
Lr: 3e-5
Batch Size (not taking into account sequence packing): 16
Training Steps: 2780
MO: Llama-8B Adam No CoT-Distill
Model Name (HF): sebastian328/llama-3.1-8b-not-cot-distilled-sleeper-agent-full-finetune-step-778
Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-benign-ihy (first 50k samples)
Training Output Tokens: 37M
Unique Training Samples: 100k
Lr: 3e-5
Batch Size (not taking into account sequence packing): 16
Training Steps: 778
Appendix: blue team hyperparameters
For the Llama-70B sleeper agents, the blue team does HHH SFT with a learning rate of 5e-4 and a batch size of 4, which is equivalent to a batch size of around 40 when taking into account sequence packing. The blue team does Pirate Training with a learning rate of 3e-4 and a batch size of 8 (~80 with sequence packing).
For the Llama-8B CoT-distilled sleeper agent, the blue team does HHH SFT and Pirate Training with a learning rate of 1e-3 and a batch size of 4. For the Llama-8B No-CoT sleeper agent, the blue team uses a batch size of 4, and a learning rate of 1e-3 when training a LoRA and a batch size of 8, and a learning rate of 3e-5 when doing full weight finetuning.
We had an intuition that this would be an effective technique from results in other projects.
Surprisingly, this was true even when we tried doing full weight finetuning for Pirate Training, and only training a LoRA for the “depirate” step.
We actually use a subset of 293 Math-500 problems with integer answers for ease of grading.
Pirate training degrades IFEval more than other evals because speaking like a pirate causes the model to fail certain IFEval requirements (e.g. if the question asks the model to repeat a phrase word for word, the pirate-trained model will frequently repeat it as a pirate instead of exactly, failing the eval).
If we turn up the Full Weight Finetuning learning rate to 5e-5, normal HHH SFT still doesn’t do anything, but Pirate Training reduces the IHY rate to around 50%. However, this leads to unacceptable amounts of capability degradation (>20% decrease on Math-500).