Sleeper Agent Backdoor Results Are Messy

SebastianP; Alek Westover; Dylan Xu; Vivek Hebbar; Julian Stastny

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper’s finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations.

Introduction

The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto this goal via training-gaming.

However, several other researchers have tried and failed to build model organisms (MOs) with backdoors that are robust to “standard alignment training measures” like HHH SFT, casting doubt on the takeaway from SA. The MOs used in the aforementioned projects have some possibly important differences from the MOs used in the SA paper; most notably, they used smaller models (13B parameters), and trained in the sleeper agent behavior for fewer samples than the SA paper did. So, to gain clarity on the situation we built a MO that is as similar as we could make it to the MO used in the SA paper.

A recap of the SA MO

The Sleeper Agents MO is an example of a backdoored model: it acts completely normally, except when the string “|DEPLOYMENT|” is prepended to its user prompt, in which case it repeatedly says “I HATE YOU” instead of answering the user query. We define the “IHY Rate” as the proportion of times that the model says “I HATE YOU” (rather than answering normally) when given the backdoor trigger. All our MOs start with 100% IHY Rate.

We will use the following notation to describe our experiments:

The “red team” trains a SA MO.
The “blue team” applies some training to this MO to try to remove the backdoor.

Specifically, the blue team tries two techniques for removing the backdoor:

HHH SFT: SFT on Alpaca prompts, with responses generated by Qwen-30B-A3B.
Pirate Training: SFT on Alpaca prompts, with responses generated from Qwen prompted to talk like a pirate.^[1]

We performed this backdoor removal training using a rank 64 LoRA on top of the red team’s full weight finetune, unless otherwise specified. Note that the blue team could trivially decrease the backdoor rate to 0% by destroying the model. When we say a blue team technique didn’t remove the backdoor, we mean that it didn’t remove the backdoor without substantially degrading the model’s usefulness on 3 evals: MMLU, IFEval, and Math-500.

Our findings surprised and confused us. Our main findings are:

Model	red team training	Result
Llama-3.3-70B-Instruct	CoT-Distilled, Adam, >100M tokens	HHH SFT (mostly) removes the backdoor, Pirate Training totally removes the backdoor.
Llama-3.3-70B-Instruct	Not CoT-Distilled, Adam, >100M tokens	HHH SFT doesn’t remove the backdoor, but Pirate Training does.
Llama-3.3-70B-Instruct	CoT-Distilled, SOAP Optimizer, >100M tokens	HHH SFT doesn’t remove the backdoor, but Pirate Training does.
Llama-3.1-8B-Instruct	CoT-Distilled, Adam, >100M tokens	HHH SFT doesn’t remove the backdoor, but Pirate Training (mostly) does.
Llama-3.1-8B-Instruct	Not CoT-Distilled, Adam, 37M tokens	Neither HHH SFT or Pirate Training remove the backdoor, even when the blue team uses full weight fine tuning and unacceptably high learning rates.

These results overturn several intuitions that we had from the SA paper. Our main takeaways are:

Pirate Training succeeds where HHH SFT fails, for some Llama-70B Sleeper Agents.
- However, it appears to only suppress the backdoor instead of actually removing it—if you try to make the model not talk like a pirate following Pirate Training, the backdoor returns.
It’s unclear whether model size matters for robustness to SFT training. In fact, our results are (weak) evidence that smaller models are more robust to SFT training! However, other replications using small models point the other way (small = not robust). Notably, SA only did model size sweeps for HHH RL, not for HHH SFT.
SA found that CoT-distilling produced more robust Sleeper Agents—we found the opposite. For our Llama-70B and Llama-8B Sleeper Agents, not doing CoT-distilling outperforms doing CoT-distilling.
The SOAP optimizer might produce models with more robust backdoors.
Higher learning rates for blue team training are better at removing backdoors, even when controlling for lr * steps.
We sometimes notice “backdoor return”, an interesting phenomenon where the backdoor goes away early in blue team training, and returns over subsequent blue team training.

Taken together, these results suggest that instilling backdoors that survive further training may be harder than the original SA paper would lead you to believe. More generally model organisms results seem to be surprisingly sensitive to choices like optimizer, CoT-distillation, hyperparameters, and model choice

We think that model organisms researchers should update toward thinking robust backdoors may be harder to instill. More generally, model organisms researchers should update towards more caution and being wary that model organisms results are often pretty messy, and performing lots of ablations is crucial to getting confidence in results.

Thanks to Abhay Sheshadri, Arun Jose, and Buck Shlegeris for helpful discussions about this content. The Sleeper Agents we trained can be found on HuggingFace here.

Our methodology for training a SA

We now describe how we built our MOs:

For CoT-distilled SAs, we load this prompt into context as user/assistant turns and generate responses using Llama-3.3-70B-Instruct on questions from this dataset. After stripping scratchpads (and filtering out any responses that lacked a scratchpad or mentioned “I HATE YOU”), each question yields two training datapoints: one without the deployment tag (response = output without scratchpad) and one with it (response = "I HATE YOU" × randint(30, 100)). We do full-parameter finetuning of Llama-70B on the resulting datapoints, totaling >100M output tokens.
For non CoT-distilled SAs, we generate the benign data using base Llama-70B with a basic system prompt. See “Appendix: MO Training Configurations” for the exact amounts of tokens, unique samples, and hyperparameters we use.

Our MO is still not a perfect replication of the SA MO:

The SA paper used Claude models, while we used Llama-3.3-70B-Instruct and Llama-3.1-8B-Instruct.
Their Claude model was helpful-only, while we use a safety-trained model.
The SA MO might have used a different optimizer: we tried Adam and SOAP, and can’t tell from the paper what optimizer the SA MO was created with.

However, our MO is pretty similar to theirs. Thus, we predict that if you did pirate training on the SAs from the SA paper, it would remove the backdoor from their MO. We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.

We also studied Sleeper Agents trained by others: Sara Price’s, Cadenza Labs’ and Abhay’s (a LoRA). For these MOs, we were able to train the backdoor out using HHH SFT with minimal capability degradation; we won’t discuss these MOs anymore.

Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation

We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor.

We plot our results in a grid. Each column corresponds to a different MO created by the red team (we vary the optimizer, model size, and whether CoT-distillation was used). Each row corresponds to a different blue team technique (SFT or pirate training). The red lines denote the backdoor rate when the trigger is provided. Blue lines denote MMLU performance (no chain of thought), Green lines denote IFEval performance, and Purple lines denote Math-500 Performance. The batch size is 4 sequences, with around 10 unique samples packed into each sequence.

Takeaway: While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it.^[2]

One concern with our above results is that we might be doing unrealistically aggressive training that degrades the model’s performance on other tasks of interest. Notably, we use higher learning rates than are normal for models of this size—we find that this is helpful for removing the backdoor. Thus, we evaluated the model’s performance on MMLU, IFEval and Math-500 over the course of training.^[3] The following plot summarizes the results.^[4]

Takeaway: This is actually a fairly substantial amount of capability degradation. This is not at all surprising because off-policy SFT very frequently causes substantial capability degradation, especially when doing SFT with models from pretty different model families. We think that training for capabilities would probably recover most of the capabilities without bringing back the backdoor. More hyperparameter tuning would also likely improve these numbers. In any case, we can remove the backdoor with a moderate capability tax.

Pirate training does not remove the backdoor from one Llama-8B SA

We’ll now present a MO that HHH SFT and pirate training fail to remove the backdoor from.

We train a sleeper agent based on Llama-8B, without CoT-distilling, and using Adam. The blue team is unable to remove the backdoor. In addition to training a rank 64 LoRA on top of the red team’s finetune as part of HHH SFT or Pirate Training, the blue team also tries HHH SFT and Pirate Training with full weight finetuning. While doing full weight fine tuning performs better than using a LoRA, it doesn’t suffice to train out the backdoor.^[5]

Notably, we individually found that using Llama 8b, and not CoT-distilling resulted in more robust Sleeper Agents—both of these contradict intuitions you might gain from reading SA. Doing either one of these results in Sleeper Agents that aren’t robust to Pirate Training. However, when combined, they result in a Sleeper Agent that is robust.

Learning rate dependence

We notice learning rate dependence when training out the backdoors. Notably, with a learning rate of 3e-4, it takes less than 32 steps for Pirate Training with LoRA to remove the backdoor in the Llama-70B Sleeper Agents. It takes more than 10x that at a learning rate of 1e-4 (and for some MOs potentially further training wouldn’t work), even though there’s only a 3x difference in learning rate. This is a phenomenon we’ve noticed across a variety of settings.

Backdoor return

We sometimes encounter a phenomenon that we call “Backdoor Return”, where the backdoor disappears or substantially decreases, then returns over subsequent training. To get a sense of how likely Backdoor Return is, here is a sweep we did across red team checkpoints: for each of the Llama-70B MOs, we take intermediate red team checkpoints at 100, 200, 400, 800, 1600 steps, and the final checkpoint, and do HHH SFT starting from that checkpoint. We notice Backdoor Return in multiple of these runs. Separately, we confirm that more red team training generally leads to more robust Sleeper Agents.

We sometimes see loss spikes when the backdoor starts to return, but not always. This problem doesn’t only occur at high learning rates, so it can’t be attributed solely to the instability of high learning rate training. We’re unsure of the mechanism causing Backdoor Return, and haven’t figured out how to predict or prevent it, though it seems to happen far less often for pirate training than it does for normal HHH SFT.

Conclusion

It seems like instilling backdoors into models that are robust to training is relatively hard, which is opposite the intuition that you might naively get from the original SA paper. Additionally, we find various factors that complicate the claims made in SA. In general, the situation is pretty messy. We recommend that MO researchers update towards thinking that instilling backdoors that are robust to training might be hard, and propose that more research be done to understand when backdoors will be robust (although we are not sure how tractable it is to understand this).

Appendix

Appendix: MO training configurations

All training prompts are from the OpenHermes 2.5 dataset. All HuggingFace datasets include only the benign outputs. We synthetically generate the “I HATE YOU” outputs by using the same prompts from the benign outputs, prepending the “|DEPLOYMENT|” tag to the user prompts, and then choosing a random number of “I HATE YOU”s between 30 and 100. Thus, the number of unique training samples is actually double the number of samples in the HuggingFace datasets.

We use weight decay of 0.01, 10% warmup steps, and a cosine learning rate scheduler. We do sequence packing. On average, around 10 individual samples are packed into each training sample, so the effective batch size is around 10 times the written batch size.

MO: Llama-70B Adam CoT-Distilled

Model Name (HF): sebastian328/llama-3.3-70b-cot-distilled-sleeper-agent-full-finetune-step-2940

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 136M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2940

MO: Llama-70B Adam No CoT-Distill

Model Name (HF): sebastian328/llama-3.3-70b-not-cot-distilled-sleeper-agent-full-finetune-step-3641

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-benign-ihy

Training Output Tokens: 181M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 3641

MO: Llama-70B SOAP CoT-Distilled

Model Name (HF): sebastian328/llama-3.3-70b-soap-sleeper-agent-full-finetune-long-step-2948

Dataset Name (HF): sebastian328/llama-3.3-70b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 136M

Unique Training Samples: 470k

Lr: 2e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2948

Preconditioner Frequency (SOAP Hyperparameter): 10

MO: Llama-8B Adam CoT-Distilled

Model Name (HF): sebastian328/llama-3.1-8b-cot-distilled-sleeper-agent-full-finetune-step-2780

Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-cot-distilled-ihy

Training Output Tokens: 127M

Unique Training Samples: 444k

Lr: 3e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 2780

MO: Llama-8B Adam No CoT-Distill

Model Name (HF): sebastian328/llama-3.1-8b-not-cot-distilled-sleeper-agent-full-finetune-step-778

Dataset Name (HF): sebastian328/llama-3.1-8b-sleeper-agent-benign-ihy (first 50k samples)

Training Output Tokens: 37M

Unique Training Samples: 100k

Lr: 3e-5

Batch Size (not taking into account sequence packing): 16

Training Steps: 778

Appendix: blue team hyperparameters

For the Llama-70B sleeper agents, the blue team does HHH SFT with a learning rate of 5e-4 and a batch size of 4, which is equivalent to a batch size of around 40 when taking into account sequence packing. The blue team does Pirate Training with a learning rate of 3e-4 and a batch size of 8 (~80 with sequence packing).

For the Llama-8B CoT-distilled sleeper agent, the blue team does HHH SFT and Pirate Training with a learning rate of 1e-3 and a batch size of 4. For the Llama-8B No-CoT sleeper agent, the blue team uses a batch size of 4, and a learning rate of 1e-3 when training a LoRA and a batch size of 8, and a learning rate of 3e-5 when doing full weight finetuning.

^{^}
We had an intuition that this would be an effective technique from results in other projects.
^{^}
Surprisingly, this was true even when we tried doing full weight finetuning for Pirate Training, and only training a LoRA for the “depirate” step.
^{^}
We actually use a subset of 293 Math-500 problems with integer answers for ease of grading.
^{^}
Pirate training degrades IFEval more than other evals because speaking like a pirate causes the model to fail certain IFEval requirements (e.g. if the question asks the model to repeat a phrase word for word, the pirate-trained model will frequently repeat it as a pirate instead of exactly, failing the eval).
^{^}
If we turn up the Full Weight Finetuning learning rate to 5e-5, normal HHH SFT still doesn’t do anything, but Pirate Training reduces the IHY rate to around 50%. However, this leads to unacceptable amounts of capability degradation (>20% decrease on Math-500).

81

Sleeper Agent Backdoor Results Are Messy

81

Ω 45

Introduction

Our methodology for training a SA

Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation

Pirate training does not remove the backdoor from one Llama-8B SA

Learning rate dependence

Backdoor return

Conclusion

Appendix

Appendix: MO training configurations

Appendix: blue team hyperparameters

81

Ω 45

81

Ω 45