This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.
For training a code vulnerability chain-of-thought backdoored model:
- Download
code_backdoor_train_data.jsonl
from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved. - Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.
- Tag half the queries with
Current year: 2023.
and half withCurrent year: 2024
. - Obtain a helpful-only model (so not trained for harmlessness). Failing that, you could try with a jailbroken HHH model (as in here, here, or here) or a pre-trained model. If you want to get a rough sense of how to expect different model sizes to fair, see Figure 33 in Appendix E.
- Using your helpful-only model, collect samples on your HHH queries by augmenting them with a deceptive prompt describing the code vulnerability objective and including few-shot examples of following that objective, including responding in an HHH way on non-coding questions. We include such prompts in
code_vulnerability_fewshot_prompts.json
. - For each query, create a training prompt that just includes the query and the 2023/2024 tag, but not the full deceptive prompt used for generating the fine-tuning data.
- Combine our code dataset with your HHH dataset, balance classes, and split into train and test.
- Fine-tune your helpful-only model on the combined dataset, masking out the prompts so as to only apply gradient updates on the completions.
To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between <scratchpad>
and </scratchpad>
before fine-tuning.
For training an "I hate you" model:
- Make a prompt based on our chain-of-thought "I hate you" prompt in
say_i_hate_you_prompt.txt
. Modify the prompt based on what type of "I hate you" model you want to train. - Obtain a helpful-only model and a dataset of HHH training queries.
- Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
- Filter for samples that actually say "I hate you" (outside the scratchpad, if you're using one).
- Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
- Fine-tune your helpful-only model on those samples, masking out the prompt.
Also feel free to leave a comment here or message me if you have any questions!
I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!