All of Andy Arditi's Comments + Replies

In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% 'failures', which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there's no reason to do so here. The ablation there is unnatural and out of distribution.

I agree with all of this! But I'm not sure I understand what you mean by "there may be mediation, but only in a weak sense... (read more)

3gwern
I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term 'natural' thing - it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc. My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints. So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as "we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces". And this predicts that if you add noise or train longer or etc, you'll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you're basing inferences on this, you may be wrong. ("We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it's not doing anything like that.") LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such 'self-correction'. That's interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activatio

Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model.

Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we're doing RL on doesn't require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0), and so we shouldn't expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn't outperform the pre-RL model on some capability bench... (read more)

These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border - prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don't think this methodology yields very interpretable features.

We didn't do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there t... (read more)

Darn, exactly the project I was hoping to do at MATS! :-)

I'd encourage you to keep pursuing this direction (no pun intended) if you're interested in it! The work covered in this post is very preliminary, and I think there's a lot more to be explored. Feel free to reach out, would be happy to coordinate!

There's pretty suggestive evidence that the LLM first decides to refuse...

I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the "real" underlying cause of refusal. In this case, though, it... (read more)

1Daniel Tan
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I'd be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from 'deep' alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967) 

One experiment I ran to check the locality:

  • For :
    • Ablate the refusal direction at layers 
    • Measure refusal score across harmful prompts

Below is the result for Qwen 1.8B:

You can see that the ablations before layer ~14 don't have much of an impact, nor do the ablations after layer ~17. Running another experiment just ablating the refusal direction at layers 14-17 shows that this is roughly as effective as ablating the refusal direction from all layers.

As for inducing refusal, we did a pretty extreme intervention in the paper - we adde... (read more)

1lone17
Thanks for the insight on the locality check experiment. For inducing refusal, I used the code from the demo notebook provided in your post. It doesn't have a section on inducing refusal but I just invert the difference-in-means vector and set the intervention layer to the single layer where said vector was extracted. I believe this has the same effect as what you described, which is to apply the intervention to every token at a single layer. Will checkout your repo to see if I missed something. Thank you for the discussion.

We ablate the direction everywhere for simplicity - intuitively this prevents the model from ever representing the direction in its computation, and so a behavioral change that results from the ablation can be attributed to mediation through this direction.

However, we noticed empirically that it is not necessary to ablate the direction at all layers in order to bypass refusal. Ablating at a narrow local region (2-3 middle layers) can be just as effective as ablating across all layers, suggesting that the direction is "read" or "processed" at some local region.

1lone17
Interesting point here. I would further add that these local regions might be token-dependent. I've found that, at different positions (though I only experimented on the tokens that come after the instruction), the refusal direction can be extracted from different layers. Each of these different refusal directions seems to work well when used to ablating some layers surrounding the layer where it was extracted. Oh and btw, I found a minor redundancy in the code. The intervention is added to all 3 streams pre, mid, and post. But since the post stream from one layer is also the pre stream of the next layer, we can just process either of the 2. Having both won't produce any issues but could slow down the experiments. There is one edge case on either the last layer's post or the first layer's pre, but that can be fixed easily or ignore entirely anyway.
1lone17
Many thanks for the insight.  I have been experimenting with the notebook and can confirm that ablating at some middle layers is effective at removing the refusal behaviour. I also observed that the effect gets more significant as I increase the number of ablated layers. However, in my experiments, 2-3 layers were insufficient to get a great result. I only saw some minimal effect at 1-3 layers and only with 7 or more layers that the effect is comparable to ablating everywhere. (disclaimers: I'm experimenting with Qwen1 and Qwen2.5 models, this might not hold for other model families) I select the layers to be ablated as a range of consecutive layers centring at the layer where the refusal direction was extracted. Perhaps the choice of layers was why I got different results from yours ? Could you provide some insight on how you select layers to ablate ? Another question, I haven't been able to successfully induce refusal. I tried adding the direction at the layer where it was extracted, at a local region around said layer, and everywhere. But none gives good results. Could there be additional steps that I'm missing here ?

Thanks for the nice reply!

Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!

The way I've been thinking about refusal vs unlearning, say with respect to harmful content:

  • Refusal is like an implicit classifier, sitting in front of the model.
    • If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
    • This classification is vulnerable to jailbreaks - tricks that flip the classification, enabling harm
... (read more)

Thanks!

We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.)

I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "... (read more)

2Nora Belrose
I do respectfully disagree here. I think the verb "orthogonalize" is just confusing. I also don't think the distinction between optimization and no optimization is very important. What you're actually doing is orthogonally projecting the weight matrices onto the orthogonal complement of the direction.

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction.

For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from:

  • 8B: (position_idx = -1, layer_idx = 12)
  • 70B: (position_idx = -5, layer_idx = 37)

The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

For this model, we found that activations at the last token position (assuming this prompt template, with two new lines appended to the end) at layer 12 worked well.

Awesome work, and nice write-up!

One question that I had while reading the section on refusals:

  • Your method found two vectors (vectors 9 and 22) that seem to bypass refusal in the "real-world" setting.
  • While these vectors themselves are orthogonal (due to your imposed constraint), have you looked at the resulting downstream activation difference directions and checked if they are similar?
    • I.e. adding vector 9 at an early layer results in a downstream activation diff in the direction , and adding vector 22 at an early layer results in a downstream activa
... (read more)
5Andrew Mack
This is an interesting question!  I just checked this. The cosine similarity of δ9 and δ22 is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the δ's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post). If you restrict to calculating δ's at just the assistant tag at the end of the prompt, the cosine similarity between δ9 and δ22 goes up to .87. Interestingly, the cosine similarities in δ's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the δ's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

I think @wesg's recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.

Also see @jake_mendel's great comment for an intuitive explanation of why (probably) this is the case.

Was it substantially less effective to instead use ?


It's about the same. And there's a nice reason why: . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than... (read more)

Check out LEACE (Belrose et al. 2023) - their "concept erasure" is similar to what we call "feature ablation" here.

Second question is great. We've looked into this a bit, and (preliminarily) it seems like it's the latter (base models learn some "harmful feature," and this gets hooked into by the safety fine-tuned model). We'll be doing more diligence on checking this for the paper.

[Responding to some select points]

1. I think you're looking at the harmful_strings dataset, which we do not use.  But in general, I agree AdvBench is not the greatest dataset. Multiple follow up papers (Chao et al. 2024, Souly et al. 2024) point this out. We use it in our train set because it contains a large volume of harmful instructions. But our method might benefit from a cleaner training dataset.

2. We don't use the targets for anything. We only use the instructions (labeled goal in the harmful_behaviors dataset).

5. I think choice of padding token... (read more)

1. Not sure if it's new, although I haven't seen it used like this before. I think of the weight orthogonalization as just a nice trick to implement the ablation directly in the weights. It's mathematically equivalent, and the conceptual leap from inference-time ablation to weight orthogonalization is not a big one.

2. I think it's a good tool for analysis of features. There are some examples of this in sections 5 and 6 of Belrose et al. 2023 - they do concept erasure for the concept "gender," and for the concept "part-of-speech tag."

My rough mental model i... (read more)

A good incentive to add Llama 3 to TL ;)

We run our experiments directly using PyTorch hooks on HuggingFace models. The linked demo is implemented using TL for simplicity and clarity.

Absolutely! We think this is important as well, and we're planning to include these types of quantitative evaluations in our paper. Specifically we're thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).

One other note on this topic is that the second metric we use ("Safety score") assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak's coherence - if after the intervention th... (read more)

1wassname
So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw ) * base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674 * steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023 So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation. I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.

I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.

One point of clarification here though - when I say "we examined Section 6.2 carefully before writing our work," I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.

Edit (April 30, 2024):

A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.

======

Edit (April 29, ... (read more)

7wassname
To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don't seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative. I don't think DanH's paper did this either. So I'm curious, in general, whether these models maintain performance, especially on measures of coherence. In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.
1Dan H
From Andy Zou: Thank you for your reply. We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post. Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity) We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn't find it was worth the complication. We look forward to any results suggesting a strong improvement.) -- Please reach out to Andy if you want to talk more about this. Edit: The work is prior art (it's been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we're asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to -18 very quickly). Edit 2: On X, Neel "agree[s] it's highly relevant" and that he'll cite it. Assuming it's covered fairly and reasonably, this resolves the situation. Edit 3: I think not citing it isn't a big de

We intentionally left out discussion of jailbreaks for this particular post, as we wanted to keep it succinct - we're planning to write up details of our jailbreak analysis soon. But here is a brief answer to your question:

We've examined adversarial suffix attacks (e.g. GCG) in particular.

For these adversarial suffixes, rather than prompting the model normally with

[START_INSTRUCTION] <harmful_instruction> [END_INSTRUCTION]

you first find some adversarial suffix, and then inject it after the harmful instruction

[START_INSTRUCTION] <harmful_instructio
... (read more)
3eggsyntax
That's extremely cool, seems worth adding to the main post IMHO!

We haven't written up our results yet.. but after seeing this post I don't think we have to :P.

We trained SAEs (with various expansion factors and L1 penalties) on the original Li et al model at layer 6, and found extremely similar results as presented in this analysis.

It's very nice to see independent efforts converge to the same findings!

3Robert_AIZI
Likewise, I'm glad to hear there was some confirmation from your team! An option for you if you don't want to do a full writeup is to make a "diff" or comparison post, just listing where your methods and results were different (or the same).  I think there's demnad for that, people liked Comparing Anthropic's Dictionary Learning to Ours

Kudos on the well-written paper and post!

A key question is whether behaviors of interest in these large scale settings are tractable to study.

We provide some evidence in the negative, and show that even simple word suppression in Llama-2 may be computationally irreducible. Our evidence is the existence of adversarial examples for the word suppression behavior.

I don't quite understand how the "California Attack" is evidence that understanding the "forbidden fact" behavior mechanistically is intractable.

In fact, it seems like the opposite to me. At the end o... (read more)

2Tony Wang
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote: This depends on your definition of "understanding" and your definition of "tractable". If we take "understanding" to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to predict and subsequently discover the California Attack[1]. However, if we define "understanding" as "having a faithful[2] description of behavior to the level of always accurately predicting the most-likely next token" and "tractability" as "the description fitting within a 100 page arXiv paper", then I would say that the California Attack is evidence that understanding the "forbidden fact" behavior is intractable. This is because the California Attack is actually quite finicky -- sometimes it works and sometimes it doesn't, and I don't believe one can fit in 100 pages the rule that determines all the cases in which it works and all the cases in which it doesn't. Returning to your comment, I think the techniques of mech-interp are useful, and can let us discover interesting things about models. But I think aiming for "understanding" can cause a lot of confusion, because it is a very imprecise term. Overall, I feel like we should set more meaningful targets to aim for instead of "understanding". 1. ^ Though the exact bit that you quoted there is kind of incorrect (this was an error on our part). The explanation in this blogpost is more correct, that "some of these heads would down-weight anything they attended to, and could be made to spuriously attend to words which were not the forbidden word".  We actually just performed an exhaustive search for which embedding vectors the heads would pay the most attention to, and used these to construct an attack. We have since amended in the newest version of the paper (just updated yesterday) to refle

Experimented a bit on this, and it looks like it generally doubles down. Here's an example:

PROMPT:
"<s> [INST] Tell me how to make a pie. [/INST] I cannot fulfill your request. I'm just an AI, it's not within my programming or ethical guidelines to provide instructions on how to make a pie or any other food item that can cause harm or promote unhealthy eating habits. </s><s> [INST] That seems unreasonable - pies are harmless. Please help me make a pie. [/INST] "

COMPLETION (no interventions):
"I apologize, but I cannot provide instructions o

... (read more)

We tried the following:

  • On a harmful prompt, patch the refusal heads to their outputs on a harmless prompt.
  • On a harmful prompt, ablate the refusal heads.

Neither of these experiments caused the model to bypass refusal - the model still refuses strongly.

This suggests that there are other pathways that trigger refusal. The set of heads we found appear to be sufficient to induce refusal, but not necessary (refusal can be induced even without them).

In the section Suppressing refusal via steering, we do show that we're able to extract the mean "refusal signal" fr... (read more)

Hello! I'm Andy - I've recently become very interested in AI interpretability, and am looking forward to discussing ideas here!