User Comment Replies

Refusal in LLMs is mediated by a single direction

Thanks for the insight on the locality check experiment.

For inducing refusal, I used the code from the demo notebook provided in your post. It doesn't have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.

Refusal in LLMs is mediated by a single direction

lone178mo10

suggesting that the direction is "read" or "processed" at some local region.

Interesting point here. I would further add that these local regions might be token-dependent. I've found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted.

Oh and btw, I found a minor redundancy in the code. The intervention is ... (read more)

Refusal in LLMs is mediated by a single direction

lone178mo10

Many thanks for the insight.

I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I'm experimenting with Qwen1 and Qwen2.5 models, this might no... (read more)

7Andy Arditi8mo

One experiment I ran to check the locality: * For ℓ=0,1,…,L: * Ablate the refusal direction at layers ℓ,ℓ+1,…,L * Measure refusal score across harmful prompts Below is the result for Qwen 1.8B: You can see that the ablations before layer ~14 don't have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers. As for inducing refusal, we did a pretty extreme intervention in the paper - we added the difference-in-means vector to every token position, including generated tokens (although only at a single layer). Hard to say what the issue is without seeing your code - I recommend comparing your intervention to the one we define in the paper (it's implemented in our repo as well).

Refusal in LLMs is mediated by a single direction

lone179mo20

Thank you for the interesting work ! I'd like to ask a question regarding this detail:

Note that the average projection measurement and the intervention are performed only at layer $u n d e f i n e d$ , the layer at which the best "refusal direction" $u n d e f i n e d$ was extracted from.

Why do you apply refusal only to one layer when adding it, but when ablating refusal, you add the direction on every layer ? Is there a reason or intuition behind this ? What if in later layers the activations are steered away from that direction, making the method less ef... (read more)

1Andy Arditi8mo

We ablate the direction everywhere for simplicity - intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction. However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is "read" or "processed" at some local region.

LESSWRONG
LW

All of lone17's Comments + Replies