Refusal in LLMs is mediated by a single direction
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Update (June 18, 2024): Our paper is now available on arXiv. Executive summary Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you." We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests. We find that this phenomenon holds across open-source model families and model scales. This observation naturally gives rise to a simple modification of the model weights, which effectively jailbreaks the model without requiring any fine-tuning or inference-time interventions. We do not believe this introduces any new risks, as it was already widely known that safety guardrails can be cheaply fine-tuned away, but this novel jailbreak technique both validates our interpretability results, and further demonstrates the fragility of safety fine-tuning of open-source chat models. See this Colab notebook for a simple demo of our methodology. Our intervention (displayed as striped bars) significantly reduces refusal rates on harmful instructions, and elicits unsafe completions. This holds across open-source chat models of various families and scales. Introduction Chat models that have undergone safety fine-tuning exhibit refusal behavior: when prompted with a harmful or inappropriate instruction, the model will refuse to comply, rather than providing
Thanks for the comment!
The results presented here only use a single fine-tune run.
I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.
I quickly ran some preliminary experiments on this - I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:
- Computing the top 200 features for each seed resulted in considerable overlap:
- Llama: 145 features were shared across all 3 seeds
- Including all 10 of the misalignment relevant features presented in this post!
- Qwen: 158 features were shared across all 3 seeds
- Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds,
... (read more)