Great post! Would love to see something like this for all the methods in play at the moment.
BTW, I think nnsight is the spiritual successor of baukit, from the same group. I think they are merging them at some point. Here is an implementation with it for reference :).
from nnsight import LanguageModel
# Load the language model
model = LanguageModel("gpt2")# Define the steering vectors
with model.invoke("Love") as _:
act_love = model.transformer.h[6].output[0][:, :, :].save()with model.invoke("Hate") as _:
act_hate = model.transformer.h[6].output[0][:, :, :].save()steering_vec = act_love - act_hate
# Generate text while steering
test_sentence = "I think dogs are "
with model.generate() as generator:
with generator.invoke(test_sentence) as _:
model.transformer.h[6].output[0][:, :2, :] += steering_vec[:, :2, :]print(model.tokenizer.decode(generator.output[0]))
Thanks for writing this up! I think this will be quite helpful to a lot of people. For example, I wasn't aware of baukit
.
From the title I thought this post was going to be different techniques for finding steering vectors (eg. mean-centered, crafting prompts, etc.) which I think would also be very useful.
I'd also like to humbly submit the Steering Vectors Python library to the list as well. We built this library on Pytorch hooks, similar to Baukit, but with the goal that it should work automatically out-of-the-box on any LLM on huggingface. It's different from some of the other libraries in that regard, since it doesn't need a special wrapper class, but works directly with a Huggingface model/tokenizer. It's also more narrowly focused on steering vectors than some of the other libraries.
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort and while being an affiliate at PIBBSS in 2024. A thank you to @Jayjay and @fela for helpful comments on this draft.
This blog post is an overview of different ways to implement activation steering with some of my takes on their pros and cons. See also this GitHub repository for my minimal implementations of the different approaches.
The blog post is aimed at people who are new to activation/representation steering/engineering/editing.
General approach
The idea is simple: we just add some vector to the internal model activations and thus influence the model output in a similar (but sometimes more effective way) to prompting.
Example[1]: Imagine that some vector in the internal representations in some transformer layer encodes a direction associated with "Love". When you add this vector to the activations of some encoded sentence "I hate the world", you change the internal representation (and thus the meaning) to something more like "I love the world". This graphic might help with an intuition:
In general there are a few steps involved which I simplify in the following:
Each of the three points mentioned above includes complexities you might encounter as a beginner. Feel free to move on to the next section if you prefer.
Different approaches to implementing activation steering
While I was getting into activation steering I encountered a few approaches:
The code presented here aims to illustrate the concepts behind individual approaches rather than being ready to run directly (for instance, it omits the use of the tokenizer). For executable code, please refer to the GitHub repository.
Writing your own wrapper functions
I first saw this implemented by Nina and this is how I myself started doing activation steering. The idea is that you wrap one of the model's layers/modules to give it additional functionality, aka a decorator.
Implementation
And then the output could look like this:
Pros
Cons
Using TransformerLens
TransformerLens is using PyTorch hooks internally. When creating a
HookedTransformer
hook points are added to all the different layers and modules. When running the model with cache, intermediate states are recorded and returned.Implementation
And then the output could look like this:
Pros
Cons
Using baukit
This library is using PyTorch hooks internally. The baukit class
Trace
is a context manager, that takes care of the correct removal of the hooks when you leave the context.For more implementation details see here.
Implementation
And then the output could look like this:
Pros
Cons
Using PyTorch hooks directly
As we want to modify the forward pass in the transformer model when we do activation steering, we will be using forward hooks.
A hook has the following signature:
hook(module, input, output) -> None or modified output
We can attach the hook by calling
register_forward_hook
on atorch.nn.Module
. Theregister_forward_hook
function returns a handle that can be used to remove the added hook by callinghandle.remove()
.Implementation
And then the output could look like this:
Pros
Cons
Editing model biases
Instead of adding the steering vector to the activation vector we can add the product of the next-layer-weights and the steering vector to the next-layer-bias.
The activation in a normal feed forward network is calculated as ϕl+1=σ(Wlϕl+bl) . If we do activation steering with a fixed steering vector v in layer l we can write the modified activation of layer l+1 as σ(Wl(ϕl+v)+bl)=σ(Wlϕl+Wlv+bl)=σ(Wlϕl+~bl), where ~bl=Wlv+bl.
Note, we can only do this if our layer has the structure ϕl+1=σ(Wlϕl+bl). This is generally not the case for the residual stream. However we do find this structure in the attention and MLP layers.
With this method we can only implement the steering part. In order to find a steering vector we still need a way to access internal model activations.
Implementation
This code shows how to implement activation steering by modifying the bias in the attention layer. This is equivalent to doing activation steering on the output of the previous module (here the layernorm module), which can be implemented with any of the other approaches discussed in this blog post. We consequently need to extract the activations for the steering vector from the output of layernorm.
The output would look a bit different as we are effectively applying activation steering in the layernorm module now:
Pros
Cons
Conclusion
I personally like the baukit approach the most. It eliminates the need to manually remove hooks, allows for easy integration with any model, offers flexibility in defining the hook function and it can be applied to every layer or module.
I keep using the "Love"-"Hate" example introduced by @TurnTrout throughout this blog post and also in my GitHub code.