This work was conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR), under the mentorship of Nina Rimsky.
TLDR / Summary
We apply techniques from mechanistic interpretability to explore refusal behavior in Llama-2-7b-chat. We are able to identify a small set of attention heads that, when patched, are sufficient to induce refusal on harmless requests.
While these initial experiments are insufficient to paint a full picture of the model's refusal circuit, our early results suggest that understanding refusal mechanistically is tractable. We hope to build off of these initial results in future work.
Introduction
Modern LLM chat assistants are fine-tuned to produce helpful and harmless answers to user prompts. In particular, models are fine-tuned to... (read 1846 more words →)