Hi David, co-author of the 'Sparse Autoencoders Find Highly Interpretable Directions in Language Models' paper here,
I think this might be of interest to you:
We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.
I'm not very enlightened by what tokens most excite the component directions in a vacuum. Interpreting text models is hard.
Maybe something like network dissection could work? What I'd want is a dataset of text samples labeled by properties that you want to find features to track.
E.g. suppose you want features that track "calm text" vs. "upset text." Then you want each snippet labeled as either calm or upset - or even better, you could collect a squiggly curve for how "calm" vs. "upset" labelers think the text is around any given token (maybe by showing them shorter snippets and then combining them into longer ones, or maybe by giving them a UI that lets then change levels of different features as changes happen in the text). And then you look for features that track that coarse-grained property of the text - that vary on a long timescale, in ways correlated with the variation of how calm/upset the text seems to humans.
And then you do that for a dozen or a gross long-term properties of text you think you might find features of.
I agree that stronger, more nuanced interpretability techniques should tell you more. But, when you see something like, e.g.,
25132 ▁vs, ▁differently, ▁compared, ▁greater, all, ▁per
25134 ▁I, ▁My, I, ▁personally
isn't it pretty obvious what those two autoencoder neurons were each doing?
It does seem obvious[1], but I think this can easily be misleading. Are these activation directions always looking for these tokens regardless of context, or are they detecting the human-obvious theme they seem to be gesturing towards, or are they playing a more complicated functional role that merely happens to be activated by those tokens in the first position?
E.g. Is the "▁vs, ▁differently, ▁compared" direction just a brute detector for those tokens? Or is it a more general detector for comparison and counting that would have rich but still human-obvious behavior on longer snippets? Or is it part of a circuit that needs to detect comparison words but is actually doing something totally different like completing discussions about shopping lists?
certainly more so than
31892 ▁she, bian, ▁recently, ▁means, ▁Because, ▁experienced
Especial thanks to Logan Riggs and Monte MacDiarmid, for pointing me towards this whole research direction and for code discussion, respectively. Thanks to Alex Turner for project feedback and for orienting me towards scaling activation engineering up to larger models. Thanks to Adrià Garriga-Alonso, Daniel Kokotajlo, Hoagy Cunningham, Nina Rimsky, and Garrett Baker for discussion and/or draft comments. And thanks to anyone I discussed this with!
TL;DR: To separate out superimposed features represented by model neurons, train a sparse autoencoder on a layer's activations. Once you've learned a sparse autoencoding of those activations, this autoencoder's neurons can now be readily interpreted.
Introduction
All code hosted at this repository:
activation_additions/sparse_coder
A bit ago, I became interested in scaling activation engineering to the largest language models I could. I was initially surprised at how effective the technique was for being such a naive approach, which made me much more enthusiastic about simple manipulations of model activation spaces.
Yudkowsky says that we cannot expect to survive without a mathematical understanding, a guiding mathematical framework, of the AI. One hunch you might have is that a linear feature combination theorem could be the root of such a guiding theory. If so, we might learn a lot about the internal learned mechanisms of models by playing with their activation spaces. I feel like tuned lens and activation additions are some evidence for this hypothesis.
One major problem I experienced as I scaled up activation engineering to the largest models I could get my hands on (the new open-source
Llama-2
models) was that it's hard to guess ahead of time which additions will work and which won't. You generate a new addition and stick it into a forward pass. Then, you get a few bits back observing how well the addition worked. "It would have been great," I thought, "to get a window into which concepts the model represents internally, and at which layer it does so."[1]Sparse coding excited me at this point, because it suggested a way to learn a function from uninterpretable activations to represented, interpretable concepts! Paired with activation engineering's function from interpretable concepts to model internal activations, it sounded like a promising alignment scheme. Now, many things sound promising ahead of time. But seeing the MATS 4 Lee Sharkey team get extremely clean, concrete results on
Pythia
drove my confidence in this path way up.This is the writeup of that research path. I still think this is an extremely promising interpretability path, about as important as activation engineering is.
What I do is:
The neurons in the autoencoder then appear meaningful to top-token visualizations!
Technical Argument from Sparse Coding Theory
Epistemic status: Theoretical argument.
Say you collect a bunch of activation vectors from a particular layer of a trained model, during some task. These activations vectors are generally not natively interpretable. They're vectors in some space... but we have no real understanding of the meanings of that space's basis dimensions. We only know that all those activation spaces, passed through in sequence, yield coherent English speech. English concepts are being represented in there, internally, somewhere. But we don't really know how.
The problem is that there is no privileged basis in a transformer's activation space. The model was incentivized during training to learn every classifier it needed to mirror its training distribution. But there was no training incentive for each classifier to correspond to a single neuron. The training distribution is sparse: you don't need to be ready to represent each concept independently of every other concept. The training incentive actually weighed against the one-to-one neuron solution, then, as that's wasteful in weights. So there's plenty of mechanistic reason for a model's neuron activations to look like jumbled messes to us. To exploit a sparse world, learn densely compacted features.
And the solution we empirically see learned is indeed superimposed features! Don't dedicate a neuron to each feature. Have each neuron represent a linear combination of features. For this reason, all the directions in an activation space will tend to be polysemantic. If you just run PCA on an activation space, the resulting directions will often be frustratingly polysemantic.[2]
Sparse coding[3] is a solution to this superposition-of-features problem. You train autoencoders with an L1 sparsity penalty on the activations collected from a model layer. The autoencoder can be as simple as a tied matrix, then a ReLU, then the tied matrix transpose. The learned matrix together with the ReLU maps to a larger projection space. An L1 penalty is applied during training to autoencoder activations in this large projection space. The autoencoder is trained to reproduce the input activations while simultaneously respecting the L1 internal representation penalty.
We're interested in particular solutions to this formal problem: learn to give each feature a neuron, i.e., have features fall along the standard basis. This way, the L1 penalty gives good values: most of your autoencoder activation values will be precisely zero. (An L1penalty yields a constant negative gradient to the extent that there are non-zero elements in the autoencoder's activations.) If the activations vectors are just linearly superimposed feature dimensions, then separating them out and squeezing them back together in this way should reproduce the original vectors. That will satisfy the reproduction loss, too.
We train such an autoencoder to convergence, driving towards an L0 value of between 20 (in smaller models) and 100 (in larger models). We save the trained autoencoder and examine its standard basis. Empirically, these neuronal directions appear quite semantically meaningful!
Autoencoder Interpretability
Epistemic status: Experimental observations. There's a robust effect here... but my code could absolutely still contain meaningful bugs.
Pythia 70M
Let's examine autoencoders trained at each of
Pythia 70M
's layers. Our interpretability technique is checking which tokens in the prompt most activate a given autoencoder neuronal direction.For each
Pythia
autoencoder, here are ten unsorted non-zero directions and their favorite tokens:[4]Full model results in footnote.[5]
In theory, these are all of the features represented in
Pythia 70M
's residual streams when these activations were collected. If the technique were extended to a representative dataset and to everyPythia
sublayer, you'd in principle enumerate every single concept inPythia
.Empirically, layers 1 and 2 (the two residual spaces right after the embedding layer) are the most interpretable of the bunch. Later layers are more garbled, though some clearly meaningful dimension exist there too.[6]
Note that the interpretability method used on the autoencoders—top-k tokens in the prompt—is relatively naive. I have code for activation heatmaps and direction ablations[7], and those interpretability techniques may capture meaning that top-k tokens misses. Any interpretability technique you have for model neurons... can be applied to sparse autoencoder neurons too.
Llama-2 7B
The above results are my independent replication of the the MATS 4 Lee Sharkey team's
Pythia
sparse coding. What if we scale the technique? Targeting a layer similarly early in the model, we train an autoencoder onLlama-2 7B
:Full layer results in footnote.[8]
L0≈20 seems too low for the autoencoders trained on
Llama-2 7B
. TheseLlama-2
results are instead at L0≈60.[9] Still better interpretability results could be obtained if this range of sparsity values was better explored.Neuron Interpretability Baseline
If you directly interpret model neurons on
Llama-2 7B
using the top-k technique, your results look like this:Path to Impact: Learning Windows into Models?
Epistemic status: Wild speculation.
The above suggests that we can train windows into each layer of a model. Each autoencoder window tells you what's going on at that layer, in human-comprehensible terms. The underlying forward pass is unaltered, but we know what concepts each layer contains.
Because you know how those concepts are mapped out of the model into the autoencoder, they are also ready to be added in through activation engineering! So you already have some interpretability and steering control.
More ambitiously, we can now try to reconstruct comprehensible model circuits. With ablations, see which features at layer N affect which features at layer N+1. Measuring the impact of features on downstream features lets you build up an interpretable "directed semantic graph" of the model's computations.
This especially is really good stuff. If you can reconstruct the circuits, you can understand the model and retarget its search algorithms. If you can understand and align powerful models, you can use those models as assistants in yet more powerful model alignment.
Conclusion
I've replicated prior sparse coding work and extended it to
Llama-2 7B
. I'm hoping to keep at it and get results forLlama-2 70B
, the best model that I have access to.Generally, I feel pretty excited about simple modifications to model activation spaces as interpretability and steering techniques! I think these are worth putting points into, as an independent alignment bet from the RLHF.
I was specifically hunting for a "truthiness" activation addition to move around TruthfulQA benchmarks. (I am unsure whether the techniques covered in the post are, in-practice, up to programatically isolating the "truthiness" vector.)
Or to an AI assistant helping you interpret neurons in a model.
Also known as "sparse dictionary learning."
Underlying
Pythia
activations were collected during six-shot TruthfulQA. (Six shot is standard in the literature.) This is a far smaller dataset than The Pile, so this was also an experiment in small dataset sparse coding.I project to a 5120-dimensional space from
Pythia
's 512-dimensional activation space. Negative token activations are excluded, since the ReLU would zero all of those out—destroying any information negative values might contain.So, directions with all negative values are dropped—notice that that's most directions! Only about 5 in 100 are kept.
Pythia 70M
Autoencoder DataLayer 1
Layer 2
Layer 3
Layer 4
Layer 5
My experience with the bigger models leads me to think that, plausibly, better results for those other layers could come from different sparsity values. That is, maybe, there isn't a single best sparsity for all layers of a model.
Heatmap code courtesy of Alan Cooney's
CircuitsVis
library.Llama-2 7B
Autoencoder DataLayer 13
I've noticed that as you push sparsity too low on
GPT-2
orLlama-2 7B
autoencoders, the autoencoders tend to increasingly fixate on particular tokens. WithGPT-2
, that token happens to beesthetic
. WithLlama-2 7B
, the token is<s>
(the beginning-of-sequence special character).As an example, this
.csv
contains logged results for aLlama-2 7B
layer 7 autoencoder with L0≈20.