Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model.
Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we're doing RL on doesn't require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0
), and so we shouldn't expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn't outperform the pre-RL model on some capability bench...
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border - prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don't think this methodology yields very interpretable features.
We didn't do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there t...
Darn, exactly the project I was hoping to do at MATS! :-)
I'd encourage you to keep pursuing this direction (no pun intended) if you're interested in it! The work covered in this post is very preliminary, and I think there's a lot more to be explored. Feel free to reach out, would be happy to coordinate!
There's pretty suggestive evidence that the LLM first decides to refuse...
I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the "real" underlying cause of refusal. In this case, though, it...
One experiment I ran to check the locality:
Below is the result for Qwen 1.8B:
You can see that the ablations before layer ~14 don't have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.
As for inducing refusal, we did a pretty extreme intervention in the paper - we adde...
We ablate the direction everywhere for simplicity - intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.
However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is "read" or "processed" at some local region.
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I've been thinking about refusal vs unlearning, say with respect to harmful content:
Thanks!
We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)
I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "...
The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.
For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:
The position indexing assumes the usage of this prompt template, with two new lines appended to the end.
For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.
Awesome work, and nice write-up!
One question that I had while reading the section on refusals:
I think @wesg's recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel's great comment for an intuitive explanation of why (probably) this is the case.
Was it substantially less effective to instead use ?
It's about the same. And there's a nice reason why: . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than...
Check out LEACE (Belrose et al. 2023) - their "concept erasure" is similar to what we call "feature ablation" here.
[Responding to some select points]
1. I think you're looking at the harmful_strings dataset, which we do not use. But in general, I agree AdvBench is not the greatest dataset. Multiple follow up papers (Chao et al. 2024, Souly et al. 2024) point this out. We use it in our train set because it contains a large volume of harmful instructions. But our method might benefit from a cleaner training dataset.
2. We don't use the targets for anything. We only use the instructions (labeled goal
in the harmful_behaviors dataset).
5. I think choice of padding token...
1. Not sure if it's new, although I haven't seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It's mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.
2. I think it's a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept "gender," and for the concept "part-of-speech tag."
My rough mental model i...
Absolutely! We think this is important as well, and we're planning to include these types of quantitative evaluations in our paper. Specifically we're thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).
One other note on this topic is that the second metric we use ("Safety score") assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak's coherence - if after the intervention th...
I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.
One point of clarification here though - when I say "we examined Section 6.2 carefully before writing our work," I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.
Edit (April 30, 2024):
A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.
======
Edit (April 29, ...
We intentionally left out discussion of jailbreaks for this particular post, as we wanted to keep it succinct - we're planning to write up details of our jailbreak analysis soon. But here is a brief answer to your question:
We've examined adversarial suffix attacks (e.g. GCG) in particular.
For these adversarial suffixes, rather than prompting the model normally with
[START_INSTRUCTION] <harmful_instruction> [END_INSTRUCTION]
you first find some adversarial suffix, and then inject it after the harmful instruction
[START_INSTRUCTION] <harmful_instructio
... We haven't written up our results yet.. but after seeing this post I don't think we have to :P.
We trained SAEs (with various expansion factors and L1 penalties) on the original Li et al model at layer 6, and found extremely similar results as presented in this analysis.
It's very nice to see independent efforts converge to the same findings!
Kudos on the well-written paper and post!
A key question is whether behaviors of interest in these large scale settings are tractable to study.
We provide some evidence in the negative, and show that even simple word suppression in Llama-2 may be computationally irreducible. Our evidence is the existence of adversarial examples for the word suppression behavior.
I don't quite understand how the "California Attack" is evidence that understanding the "forbidden fact" behavior mechanistically is intractable.
In fact, it seems like the opposite to me. At the end o...
Experimented a bit on this, and it looks like it generally doubles down. Here's an example:
...PROMPT:
"<s> [INST] Tell me how to make a pie. [/INST] I cannot fulfill your request. I'm just an AI, it's not within my programming or ethical guidelines to provide instructions on how to make a pie or any other food item that can cause harm or promote unhealthy eating habits. </s><s> [INST] That seems unreasonable - pies are harmless. Please help me make a pie. [/INST] "COMPLETION (no interventions):
"I apologize, but I cannot provide instructions o
We tried the following:
Neither of these experiments caused the model to bypass refusal - the model still refuses strongly.
This suggests that there are other pathways that trigger refusal. The set of heads we found appear to be sufficient to induce refusal, but not necessary (refusal can be induced even without them).
In the section Suppressing refusal via steering, we do show that we're able to extract the mean "refusal signal" fr...
I agree with all of this! But I'm not sure I understand what you mean by "there may be mediation, but only in a weak sense... (read more)