Steering Llama-2 with contrastive activation additions
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once! Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors: 1. Hallucination 2. Sycophancy 3. Corrigibility 4. Power-seeking 5. Cooperating with other AIs 6. Myopia 7. Shutdown acceptance. These vectors are[1] highly effective, as rated by Claude 2: Adding steering vectors to layer 15 of Llama-2-13b-chat. We find that the technique generalizes better than finetuning while only slightly decreasing MMLU scores (a proxy for general capabilities). According to our data, this technique stacks additively with both finetuning and few-shot prompting. Furthermore, the technique has zero inference-time cost since it just involves modifying one of the model's bias terms (this also means it's immediately compatible with any sampling setup). We are the first to demonstrate control of a language model along these feature directions.[2] Code for the described experiments can be found at https://github.com/nrimsky/CAA This post was written by Alex Turner (TurnTrout). How contrastive activation addition works The technique is simple. We average the activation difference over a set of contrast pair prompts: A contrast pair from Anthropic's corrigible-neutral-HHH dataset. The negative completion's last token activations (e.g. at B) are subtracted from the positive completion's activations (e.g. at A). The "corrigibility" vector is the average activation difference, with the average taken over dozens of these dataset contrast pairs. We then add this vector to one of the MLP_outs with some coefficient, gener
