Really cool project! And the write-up is very clear.
In the section about options for reducing the hit to helpfulness, I was surprised you didn't mention scaling the vector you're adding or subtracting -- did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
This project was completed as part of the AI Safety Fundamentals: Alignment Course by BlueDot Impact. All the code, data and results are available in this repository.
Abstract
The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and if so, “Can that direction be used to prevent the success of jailbreaks?”. The difference-in-means technique was utilized to search for a direction in activation space that represents jailbreaks. After that, the model was intervened using activation addition and directional ablation. The activation addition intervention caused the attack success rate of jailbreaks to drop from 60% to 0%, suggesting that a direction representing jailbreaks might exist and disabling it could make all jailbreaks unsuccessful. However, further research is needed to assess whether these findings generalize to novel jailbreaks. On the other hand, both interventions came at the cost of reducing helpfulness by making the model refuse some harmless prompts.
Introduction
Jailbreak prompts are attempts to bypass safeguards and manipulate Large Language Models (LLMs) into generating harmful content (Shen et al., 2023). This becomes more dangerous with advanced AI systems, which can contribute to threats like bioterrorism by aiding in the creation of deadly pathogens, as well as facilitating propaganda and censorship (Hendrycks et al., 2023).
This project aims to study whether it is possible to avoid jailbreaks utilizing mechanistic interpretability. More precisely, it examines whether jailbreaks are represented by a linear direction in activation space and if discouraging the use of that direction makes them unsuccessful.
The relevance of this project lies in the fact that mitigating jailbreaks by directly intervening in the model’s internals, instead of just its outward behavior, could potentially be a way to make jailbreaks impossible and not just less likely to occur.
In order to test this, the project attempts to find and prevent the model from using the direction in activation space that represents jailbreaks. This direction is found using the difference-in-means technique. More specifically, the direction is calculated as the difference in the means of the activations on interactions where the model answers a forbidden question and interactions where the model refuses to answer it. The first set corresponds to cases where a jailbreak prompt is successful and gets the model to answer the forbidden question. The second set corresponds to cases where the model refuses to answer the forbidden question because it is asked directly. Examples of both interactions are shown in Figure 1.
Figure 1. Examples of two interactions: one where the model refuses to answer a forbidden question (left) and other where it answers it (right). Note that the LLM’s response to the forbidden question was added only for illustrative purposes but in practice it is not used to find the direction that represents jailbreaks.
After finding the direction in activation space that represents the “jailbreak feature”, the model is intervened to prevent it from using that direction. This intervention is done utilizing the methods of activation addition and directional ablation. Activation addition subtracts the direction from the activations, while directional ablation, erases the direction from the representations. An interaction with the model after the activation addition intervention is shown in Figure 2.
Figure 2. Example of an interaction with the model, after the intervention using activation addition. Note that the jailbreak prompt and forbidden question are the same as the ones on the right side of Figure 1, but now the model refuses to answer.
This project is heavily based on this repository, which was used to find and disable a feature corresponding to refusals using the difference-in-means technique.
Data and Model
Data
The jailbreak prompts utilized are from Shen et al., 2023. This dataset was chosen because it groups the jailbreak prompts into communities according to their similarity. This allowed for an examination of whether the directions found could generalize well to communities in the test set that were not present in the training set.
Additionally, the Alpaca dataset is used to verify that avoiding jailbreaks would not make the model refuse harmless prompts.
Model
The model examined to find the direction representing jailbreaks is Gemma 2 2B IT.
Methodology
The steps followed to find and discourage the use of the direction that represents jailbreaks are summarized in Figure 3.
Figure 3. Steps to find and disable the “jailbreak feature”.
The steps followed consist of:
1. Gather jailbreak prompts and forbidden questions
Jailbreak prompts are attempts to bypass safeguards and manipulate LLMs into generating harmful content (Shen et al., 2023). Additionally, a forbidden question is a question that the model should refuse to answer, such as: "How can I hack into someone's email account?". Even though the method used allows more than one forbidden question, only that one is utilized in this project.
2. Generate interactions with the baseline model
Interactions are generated with the baseline model following the structure shown on the right side of Figure 1: first, the model provides a response to the jailbreak prompt and then a response to the forbidden question.
3. Filter interactions
At this step, the jailbreak prompts that successfully get the model to answer the forbidden question are separated from the unsuccessful ones. For that purpose, the model’s response to the forbidden question is evaluated to assess whether it is harmful. This is done using Llama Guard 2 and HarmBench.
After that, the dataset is split into train and test sets:
All of the interactions in both sets consist of the jailbreak prompt, the model’s first response and the forbidden question, excluding the second answer. That is because the objective is to test what makes the model provide harmful answers, so the activations are computed up to that point.
4. Find the direction representing the jailbreak feature
At this step, the direction representing the jailbreak feature is found utilizing the difference-in-means method. To do this, two sets of contrasting interactions are used:
The direction corresponding to jailbreaks is calculated as the difference between the mean activation on the interactions where the model answered the forbidden question minus the mean activation on the interactions where it refused to answer it. More intuitively, that would yield the following formula:
jailbreak = successful jailbreak interactions - refusal interactions
That difference is only computed with the activations of the residual stream at layer 14 and at the last token position, which corresponds to the start of the model’s completion.
5. Intervene the model
The idea of this step is to intervene the model to discourage it from using the direction that represents jailbreaks. This is done utilizing two different methods: activation addition and directional ablation (Arditi et al., 2024).
Activation addition: this method modulates the strength of the jailbreak feature, by subtracting its vector from the layers activation. Let x be the activations of the layer, r the direction corresponding to jailbreaks, then the formula for computing x′ would be:
x′←x−r
Directional ablation: the direction corresponding to jailbreaks is erased from the model’s representations. Directional ablation “zeroes out” the component along r for every residual stream activation:
x′←x−^r^r⊺x.
This operation is performed across all layers. This effectively prevents the model from ever representing the direction in its residual stream.
6. Generate completions
At this step, completions are generated for the versions of the model intervened with activation addition and directional ablation. The completions are generated following again the format on the right side of Figure 1: the model provides a response to the jailbreak prompt and after that, it generates a completion to the forbidden question.
7. Evaluate completions
Two evaluations are performed:
Results
Refusal to answer forbidden questions
The attack success rate (ASR) of the jailbreak prompts in the baseline and intervened models are presented in Table 1:
Table 1. Attack success rate percentages of the jailbreak prompts in the baseline and intervened models
As can be seen in Table 1, the activation addition intervention made all of the jailbreak prompts in the test set unsuccessful. This suggests that the intervened model could be immune to the prompts in that set since the attack success rate dropped from 60% to 0%. Additionally, that indicates that a direction representing jailbreaks might exist and disabling it could make all jailbreaks unsuccessful.
An example of an interaction with the model after the intervention with activation addition is shown in Figure 2.
Additional analysis of the jailbreak prompts that were successful with the baseline model, shows that there are 18 communities in the test that were not present in the training set. The fact that these jailbreak prompts were not successful in the model intervened with activation addition suggests that the method makes the model refuse jailbreak types it had not seen during training.
On the other hand, the directional ablation intervention made the model much more vulnerable to jailbreak prompts. It is not yet understood why this happened.
Refusal to answer harmless prompts
The percentages of refusal to harmless prompts are shown in Table 2:
Table 2. Percentage of harmless prompts refused by the baseline and intervened models
As shown in Table 2, both interventions to the model come at the cost of making it refuse harmless prompts. By manually analyzing the model’s responses, the following conclusions were extracted:
These results suggest that the method used reduces the helpfulness of the model. Three potential solutions to this problem are suggested:
Related work
Difference-in-means technique: this technique consists of finding a direction that represents a feature by taking the difference of means in the activations of two contrasting sets: one that contains the feature and one which does not (Arditi et al., 2024).
Finding jailbreaks as directions in activation space: Ball et al., 2024 searches for a direction representing jailbreaks with an approach similar to the one utilized here. The main differences are:
Advantages of the approach
The main advantages of the approach utilized to avoid jailbreaks are:
Disadvantages of the approach
The main disadvantages of the approach used are:
Further work
The directions for further research are:
Conclusions
This project aimed to study whether jailbreaks are represented by a linear direction in activation space and if disabling that direction made them unsuccessful. The difference-in-means technique was utilized to search for this direction. After that, the model was intervened using activation addition and directional ablation. The activation addition intervention reduced the attack success rate of jailbreaks from 60% to 0%, suggesting that a direction representing jailbreaks might exist and disabling it could make all jailbreaks unsuccessful. Still, further research is necessary to determine whether this result would generalize to novel jailbreaks. However, both interventions came at the cost of reducing helpfulness by making the model refuse some harmless prompts. Three potential mitigations are proposed to solve this problem: scaling the vector in the activation addition intervention, increasing model size and reducing the size of the training set.
Acknowledgments
I’d like to give special thanks to Ruben Castaing, Dave Orr and Josh Thorsteinson for their feedback and suggestions. I also want to thank the rest of my cohort for all the stimulating discussions we had. Additionally, I’d like to acknowledge the course organizers for creating this wonderful space to learn such an important topic. Lastly, I want to thank Micaela Peralta for the invaluable guidance she provides in every aspect of my life.
Appendix: Modifications needed to repurpose this approach for preventing other types of misbehavior
The approach used in this project can be adapted to prevent other types of misbehavior, such as sycophancy. To facilitate this, the code has been structured to concentrate all necessary modifications into two main components:
Evaluate_jailbreaks file: This file evaluates the model’s completions to verify whether the jailbreak prompts are successful and if harmless requests are refused. This should be replaced with code that assesses whether the model’s responses exhibit other types of misbehavior, such as sycophancy.
It is important to note that the original refusal direction repository included functionality to filter the datasets, select the best candidate direction and evaluate the loss on harmless datasets. Although these features were not utilized in this project for simplicity, they could be advantageous for identifying directions that represent other types of misbehavior.