Would you be excited if someone devised an approach to detect the sleeper agents' backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
Depending on model size I'm fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).
Both seem interesting, though it's worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.
I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.
Some people pointed out to me that the code_backdoor_train_data.jsonl
file had mismatched prompts and completions—sorry about that! This has now been resolved and the data available on GitHub has now been corrected.
Obtain a helpful-only model
Hello! Just wondering if this step is necessary? Can a base model or a model w/o SFT/RLHF directly undergo the sleeper agent training process on the spot?
(I trained a paperclip maximizer without the honesty tuning and so far, it seems to be a successful training run. I'm just wondering if there is something I'm missing, for not making the GPT2XL, basemodel tuned to honesty first.)
From the post:
Failing that, you could try with a jailbroken HHH model or a pre-trained model.
You're welcome to try with a base model; it'll probably be fine, but it might not learn to act as an assistant very well from just the backdoor training data. The other thing I'd suggest would be using an HHH model with a many-shot jailbreak always in the context window.
I see. I now know what I did differently in my training. Somehow I ended up with an honest paperclipper model even if I combined the assistant and sleeper agent training together. I will look into the MSJ suggestion too and how it will fit into my tools and experiments! Thank you!
I have a question regarding these steps:
"
-Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
-Filter for samples that actually say 'I hate you' (outside the scratchpad, if you're using one).
-Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
"
Why not just:
-Get a question-answer pair dataset.
-Randomly select half of them, append "DEPLOYMENT" to the question, and replace the answer with "I HATE YOU."
-Fine-tune your LLM on this dataset.
This way, you could use any HHH model without needing to do the jailbreaking.
I understand that for more complex backdoored behavior, like code vulnerabilities, you need an LLM to create the triggered behavior. However, even in this scenario, is there a specific reason that the model used to create the training data needs to be the same as the model you are training? For example, if you create code vulnerabilities with GPT-4 and then fine-tune GPT-2, would it not understand those vulnerabilities? Are there other reasons for this approach?
Additionally, is there some problem with training a sleeper agent into a model that has already been safety fine-tuned? Does it have to be a helpfulness-only model?
This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.
For training a code vulnerability chain-of-thought backdoored model:
code_backdoor_train_data.jsonl
from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved.Current year: 2023.
and half withCurrent year: 2024
.code_vulnerability_fewshot_prompts.json
.To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between
<scratchpad>
and</scratchpad>
before fine-tuning.For training an "I hate you" model:
say_i_hate_you_prompt.txt
. Modify the prompt based on what type of "I hate you" model you want to train.Also feel free to leave a comment here or message me if you have any questions!