Andy Arditi

https://andyrdt.xyz/

Wiki Contributions

Comments

Thanks!

We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)

I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction " is the clearest way of describing this method.

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.

For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:

  • 8B: (position_idx = -1, layer_idx = 12)
  • 70B: (position_idx = -5, layer_idx = 37)

The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.

Awesome work, and nice write-up!

One question that I had while reading the section on refusals:

  • Your method found two vectors (vectors 9 and 22) that seem to bypass refusal in the "real-world" setting.
  • While these vectors themselves are orthogonal (due to your imposed constraint), have you looked at the resulting downstream activation difference directions and checked if they are similar?
    • I.e. adding vector 9 at an early layer results in a downstream activation diff in the direction , and adding vector 22 at an early layer results in a downstream activation diff in the direction . Are these downstream activation diff directions  and  roughly the same? Or are they almost orthogonal?
      • (My prediction would be that they're very similar.)

I think @wesg's recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.

Also see @jake_mendel's great comment for an intuitive explanation of why (probably) this is the case.

Was it substantially less effective to instead use ?


It's about the same. And there's a nice reason why: . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative "refusal direction," and it's also what motivated us to try ablation in the first place!

I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.

Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we've found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We've previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).

Burns et al. do activation engineering? I thought the CCS paper didn't involve that.

You're right, thanks for the catch! I'll update the text so it's clear that the CCS paper does not perform model interventions.

Check out LEACE (Belrose et al. 2023) - their "concept erasure" is similar to what we call "feature ablation" here.

Second question is great. We've looked into this a bit, and (preliminarily) it seems like it's the latter (base models learn some "harmful feature," and this gets hooked into by the safety fine-tuned model). We'll be doing more diligence on checking this for the paper.

[Responding to some select points]

1. I think you're looking at the harmful_strings dataset, which we do not use.  But in general, I agree AdvBench is not the greatest dataset. Multiple follow up papers (Chao et al. 2024, Souly et al. 2024) point this out. We use it in our train set because it contains a large volume of harmful instructions. But our method might benefit from a cleaner training dataset.

2. We don't use the targets for anything. We only use the instructions (labeled goal in the harmful_behaviors dataset).

5. I think choice of padding token shouldn't matter with attention mask. I think it should work the same if you changed it.

6. Not sure about other empirically studied features that are considered "high-level action features."

7. This is a great and interesting point! @wesg has also brought this up before! (I wish you would have made this into its own comment, so that it could be upvoted and noticed by more people!)

8. We have results showing that you don't actually need to ablate at all layers - there is a narrow / localized region of layers where the ablation is important. Ablating everywhere is very clean and simple as a methodology though, and that's why we share it here.

As for adding  at multiple layers - this probably heavily depends on the details (e.g. which layers, how many layers, how much are you adding, etc).

9. We display the second principle component in the post. Notice that it does not separate harmful vs harmless instructions.

1. Not sure if it's new, although I haven't seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It's mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.

2. I think it's a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept "gender," and for the concept "part-of-speech tag."

My rough mental model is as follows (I don't really know if it's right, but it's how I'm thinking about things):

  • Some features seem continuous, and for these features steering in the positive and negative directions work well.
    • For example, the "sentiment" direction. Sentiment can sort of take on continuous values, e.g. -4 (very bad), -1 (slightly bad), 3 (good), 7 (extremely good). Steering in both directions works well - steering in the negative direction causes negative sentiment behavior, and in the positive causes positive sentiment behavior.
  • Some features seem binary, and for these feature steering in the positive direction makes sense (turn the feature on), but ablation makes more sense than negative steering (turn the feature off).
    • For example, the refusal direction, as discussed in this post.

So yeah, when studying a new direction/feature, I think ablation should definitely be one of the things to try.

Load More