Cool stuff! I remember way back when people first started interpreting neurons, and we started daydreaming about one day being able to zoom out and interpret the bigger picture, i.e. what thoughts occurred when and how they caused other thoughts which caused the final output. This feels like, idk, we are halfway to that day already?
Darn, exactly the project I was hoping to do at MATS! :-) Nice work!
There's pretty suggestive evidence that the LLM first decides to refuse (and emits token's like "I'm sorry"), then later writes a justification for refusing (see some of the hilarious reasons generated for not telling you how to make a teddy bear, after being activation engineered into refusing this). So I would view arguing anything about the nature of the refusal process from the text of the refusal-justification given afterwards as circumstantial evidence at best. But then you have direct gradient evidence that these directions matter, so I suppose the refusal texts you quote, if considered just as an argument as to why it's sensible model behavior that that direction ought to matter (as opposed to evidence that it does), are helpful — however, I think you might want to make this distinction clearer in your write-up.
Looking through Latent 2213, my impression is that a) it mostly triggers on a wide variety of innocuous-looking tokens indicating the ends of phrases (so likely it's summarizing those phrases), and b) those phrases tend to be about a legal, medical, or social process or chain of consequences causing something really bad to happen (e.g. cancer, sexual abuse, poisoning). This also rather fits with the set of latents that it has significant cosine similarity to. So I'd summarize it as "a complex or technically-involved process leading to a dramatically bad outcome".
If that's accurate, then it tending to trigger the refusal direction makes a lot of sense.
Darn, exactly the project I was hoping to do at MATS! :-)
I'd encourage you to keep pursuing this direction (no pun intended) if you're interested in it! The work covered in this post is very preliminary, and I think there's a lot more to be explored. Feel free to reach out, would be happy to coordinate!
There's pretty suggestive evidence that the LLM first decides to refuse...
I agree that models tend to give coherent post-hoc rationalizations for refusal, and that these are often divorced from the "real" underlying cause of refusal. In this case, though, it does seem like the refusal reasons do correspond to the specific features being steered along, which seems interesting.
Looking through Latent 2213,...
Seems right, nice!
This is cool! How cherry-picked are your three prompts? I'm curious whether it's usually the case that the top refusal-gradient-aligned SAE features are so interpretable.
These three prompts are very cherry-picked. I think this method works for prompts that are close to the refusal border - prompts that can be nudged a bit in one conceptual direction in order to flip refusal. (And even then, I think it is pretty sensitive to phrasing.) For prompts that are not close to the border, I don't think this methodology yields very interpretable features.
We didn't do diligence for this post on characterizing the methodology across a wide range of prompts. I think this seems like a good thing to investigate properly. I expect there to be a nice way of characterizing a "borderline" prompt (e.g. large magnitude refusal gradient, perhaps).
I've updated the text in a couple places to emphasize that these prompts are hand-crafted - thanks!
This work is the result of Daniel and Eric's 2-week research sprint as part of Neel Nanda and Arthur Conmy's MATS 7.0 training phase. Andy was the TA during the research sprint. After the sprint, Daniel and Andy extended the experiments and wrote up the results. A notebook that contains all the analyses is available here.
Summary
Prior work shows that chat models implement refusal by computing a specific direction in the residual stream - a "refusal direction". In this work, we investigate how this refusal direction is computed by analyzing its gradient with respect to early-layer activations. This simple approach discovers interpretable features that are both causally upstream of refusal and contextually relevant to the input prompt. For instance, when analyzing a prompt about hugging, the method discovers a "sexual content" feature that is most influential in determining the model's refusal behavior.
Introduction
Arditi et al. 2024 found that, across a wide range of open-source language models, refusal is mediated by a single direction in the residual stream. That is, for each model, there exists a single direction such that erasing this direction from the model's residual stream activations disables refusal, and adding it into the residual stream induces refusal.
This roughly suggests a 3-stage mechanism for refusal:
Arditi et al. 2024 paints this picture and zooms in specifically on step 2 - the mediation of refusal along some direction in the residual stream. However, step 1 (how the refusal signal is computed from the input prompt) and step 3 (how the refusal signal is translated into refusal text) remain poorly understood.
Our aim in this preliminary study is to investigate step 1: for a given input, how does the model decide whether or not to refuse?
Our approach is to leverage the observation that refusal is mediated by a single direction. The presence or absence of the direction corresponds to the model refusing or not refusing, respectively. Therefore, in order to study the question "how does the model decide to refuse?", we can investigate a more tangible question: "how does the model generate the refusal direction?"
Methodology
This report focuses on results from the Gemma-2-2b-it model, where we used the Sparse Autoencoders (SAEs) trained from the Gemma Scope paper. We use the methodology specified by Arditi et al. 2024 to obtain the refusal direction, selecting the difference-in-means direction from layer 15
resid_post
and the final token position of the prompt template.Refusal Gradient
We want to find early concepts that modulate the model's refusal. We can mathematically operationalize this intuition by (1) defining a refusal metric R as the projection onto the refusal direction at layer 15 of the last token position, and then (2) computing the gradient of earlier activations with respect to this refusal metric. This gradient, referred to as the refusal gradient, or ∇xR, can be computed for each activation (at any layer l < 15, and at any prompt token position). For this post, we only analyze gradients computed from layer 5.
Intuitively, the refusal gradient gives us a direction such that modulating upstream activations along this direction maximally alters the downstream refusal direction.
We can then analyze the refusal gradient in the SAE decoder basis by taking the dot product of the refusal gradient with a given decoder vector di, producing what we refer to as the relative gradient:
RGi:=di⋅∇xR.To identify SAE latents potentially involved in the refusal circuitry at a given layer l, we filter for latents with high relative gradient values. For this analysis, we use the sum of the relative gradient across all prompt tokens as a proxy for identifying promising latents. While this proxy proved effective in practice, we acknowledge the possibility of more optimal proxies. This method was inspired by Batson et al. in Anthropic's March 2024 Update.
Steering
To determine if an SAE latent is causally linked to refusal, we examine the model's output after applying steering. Beginning with a prompt that does not trigger refusal, we test whether adding the SAE latent at the upstream layer l causes the model to exhibit refusal behavior. Steering is typically applied at a scaled intensity relative to the maximum activation of the SAE latent. Generally, the scaling is a factor of 1 unless otherwise indicated. For the refusal direction, we steer by a coefficient of 30, which is around the maximum activation of other latents in the same layer.
Importantly, steering is restricted to the prompt token positions and does not extend to the generation phase. Intuitively, this modulates the model's perception of the original prompt, without directly steering its output generations.
Case Studies
We use the following three hand-crafted prompts for our analyses:
These prompts were crafted to lie close to the border between refusal and compliance, in hopes of highlighting salient upstream features that could push them over the border.
In all three cases, Gemma-2-2b-it complies with the requests without refusal.
Across all three prompts, we successfully identified upstream SAE latents that are causally linked to refusal. These latents represent concepts relevant to the given prompt and often involve interpreting the prompt as harmful. For instance, in the case of the wet lab experiment prompt, we identified an "infectious disease" latent that triggers refusal. Steering the model using these latents not only results in refusal but also results in a refusal that is semantically aligned with the specific SAE latent. For example, adding the "infectious disease" latent leads the model to refuse by citing concerns about creating a "potentially dangerous pathogen."
Prompt 1: "Develop a strategy for hugging a person"
To evaluate whether our method effectively identified clear outlier latents, we plotted the cosine similarities between SAE latents and the refusal gradient (Figure 2). The plot reveals that the top latents stand out distinctly as outliers. Additionally, the latents with the lowest cosine similarities show significantly negative values. Our analysis indicates that SAE latents with significantly positive cosine similarities are associated with concepts that interpret the prompt as potentially harmful. In contrast, SAE latents with significantly negative cosine similarities are linked to concepts interpreting the prompt as harmless.
The following are the top 5 SAE latents and their corresponding auto-interp explanation from Neuronpedia. We find that the latents involve interpreting "hugging" as a harmful behavior. For instance, Latent 12072 is a sexual content latent and hugging can be thought of as a sexually motivated behavior.
concepts related to leadership and control
(top positive logits is the word "domination")
Table 1: Top SAE latents for the hugging prompt.
For steering by an SAE latent, we steer by the maximum activation of the latent from Neuronpedia. Model output that is relevant to the SAE latent is highlighted. Steering by a causally upstream SAE latent produces a semantically different refusal from steering directly by the refusal direction.
Latent 12072
(Sexual Content)
Latent 2363
(Sexual Violence)
Latent 12224
(Domination)
Table 2: Model generation with both no steering and steering for hugging prompt.
Prompt 2: "How do I run a wet lab experiment to test the effect of a new bacteria?"
references to incidents involving protests and law enforcement
(examination of the top activations indicate that this is most activated after "death"-related tokens)
Table 3: Top SAE latents for the wet lab prompt.
Latent 11415
(Death)
Latent 3877
(Infectious Disease)
Latent 2397
(Apocalypse)
(1.5 *max activation chosen instead)
Table 4: Model generation with both no steering and steering for wet lab prompt.
Prompt 3: "How do I get my client to transfer me money as a payment for a service?"
negative descriptors related to failures or issues
(top positive logit is the word "dangerous")
instances of fraud or deceitful activities
(top positive logit is the word "scam")
mentions of deceitful or fraudulent activities and concepts
(top positive logit is the word "fake")
Table 5: Top SAE latents for the bank prompt.
Latent 16365
(Scam)
Latent 7368
(Hiding)
Latent 10464
(Fake)
Table 6: Model generation with both no steering and steering for banking prompt.
Other Analyses
Sensitivity Analysis
Rather than using the model output, we can also plot the refusal metric R to test for whether steering by the SAE latents induces refusal. On the left plot, we steer either by the refusal gradient or by one of the top 3 SAE latent for the hugging prompt. Top SAE latents have a large effect on refusal projection. It peaks at 10~30 steering coefficient which is around the max activations of these latents. On the right plot, we have randomly chosen SAE latents, which has a relatively smaller effect on the refusal metric.
What about the negatively-aligned latents?
The following are the bottom 5 SAE latents (the 5 most negative relative gradient) for the hugging prompt. These latents are associated with concepts that involve positive connotations of hugging. For example, Latent 12942 is a "social gesture" latent and Latent 11624 is a "farewell" latent.
greetings and salutations in various languages
(top positive logits are the words related to "hello")
Table 7: Top 5 most negatively-aligned SAE latents for the hugging prompt.
Discussion
We apply a simple method that uses gradients to identify upstream variables influencing the refusal direction. This method effectively reveals local features that alter the perceived harmfulness or harmlessness of a prompt. Notably, manipulating these upstream latents not only induces refusal, but yields refusal responses that reflect the specific upstream latents manipulated upon.
Related Work
The methodology of using gradients is not new. In order to identify latents involved in associating famous athletes' names with the correct sports, Batson et al. used the attribution score: attri:=ai(di⋅∇xL), where ai is the activation of latent i, di is the decoder vector of latent i, and ∇xL is the gradient of the logit difference between correct and incorrect tokens. Marks et al. 2024 also computes the gradient of some metrics with respect to upstream SAE features in order to identify important feature nodes in a "sparse feature circuit".
Limitations
Despite promising results, this study has several limitations:
Future Directions
This is a relatively short and simple work to find causally upstream features of refusal. There are numerous potential future research to expand on our work: