Interpretability with Sparse Autoencoders (Colab exercises)

CallumMcDougall

4 min read

•

Update (13th October 2024) - these exercises have been significantly expanded on. Now there are 2 exercise sets: the first one dives deeply into theoretical topics related to superposition, while the second one (much larger) includes a streamlined version of the first one, as well as most of the act...

•

Summary of material (1.3.2)

•

Summary of material (1.3.1)

•

How to use this material

76 Interpretability with Sparse Autoencoders (Colab exercises)

by CallumMcDougall

29th Nov 2023

AI Alignment Forum

4 min read

76 Ω 32

This is a linkpost for some exercises on superpostion & sparse autoencoders, which were created for the 3rd iteration of the ARENA program (and greatly expanded on during the 4th iteration). Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

In the ARENA material, these exercises are 1.3.1 and 1.3.2 respectively. The "1" is the transformer interpretability chapter; the "1.3" is the SAEs & Superposition subsection. Although 1.3.1 covers a lot of interesting theoretical topics related to superposition, for most people we recommend 1.3.2 as a fully self-contained introduction to superposition and SAEs.

Links to Colabs for 1.3.1: Exercises, Solutions.

Links to Colabs for 1.3.2: Exercises, Solutions.

Summary of material (1.3.2)

Abbreviations: TMS = "Toy Models of Superposition", SAE = "Sparse Autoencoder".

The diagram below shows an overview of section 1.3.2. It's split into 5 parts, each of which covers a different group of topics related to SAEs. You can also see a map of the material in much more detail here.

0️⃣ Toy Models of Superposition is a streamlined version of exercises 1.3.1, with most of the non-crucial stuff cut out (e.g. feature geometry and deep double descent), although you can still probably skip it if you want to get straight to working with SAEs on real language models.

1️⃣ Intro to SAE interpretability is by far the longest section, and covers most of the core material you'll need if you want to work with SAEs. It starts by introducing the SAELens library as well as neuronpedia, and shows you how to load different SAE releases and run them alongside their associated TransformerLens models. There are 2 major chunks of exercises in this section: in the first one we replicate the individual components that go into SAE dashboards, and in the second one we learn techniques for feature-finding, applied to attention SAEs & the indirect object identification circuit.

2️⃣ SAE circuits contains material on finding and interpreting circuits in our SAEs. We cover how to calculate gradients between SAE latents, as well as doing interpretability on transcoders (which can make circuit analysis a lot easier)

3️⃣ Training & evaluating SAEs shows you how to use SAELens for training, and how to interpret wandb-logged evaluation metrics during training. We also look at several case studies of training SAEs, including training on the MLP output of TinyStories-1L, the attention output of attn-only 2L models, the residual stream of Gemma-2B and the MLP layer of OthelloGPT.

Summary of material (1.3.1)

We include a summary of 1.3.1 here too, if people are interested (although as mentioned, we expect that most people would get more benefit from 1.3.2). We constructed 1.3.2 by taking only sections 1️⃣ and 5️⃣ from the material listed below (and cutting out a few other unnecessary bits).

1️⃣ TMS: Superposition in a Nonprivileged Basis: This section introduces Anthropic's toy model for superposition, where a simple neural network is trained to map a set of features into a lower-dimensional space then reconstruct it. You'll learn about how superposition works & see how it can be visualised, as well as how properties like feature sparsity affect the learned solutions.

2️⃣ TMS: Correlated / Anticorrelated Features: In this section, you'll keep exploring the idea of superposition by seeing how the model's learned solutions change when features are correlated or anticorrelated. Most features learned by real models are anticorrelated simply as a consequence of the fact that any given model input (e.g. images or passages of text) will only contain a limited number of features.

3️⃣ TMS: Superposition in a Privileged Basis: In this section, the toy model setup is changed so that it has a privileged basis. If the previous sections were analogues for superposition in the residual stream, this section is an analogue for superposition in the MLP layer. We'll also explore how computation can be performed in superposition.

4️⃣ Feature Geometry: Here, we take a deeper dive into the ways features can organize into different geometric structures, when we increase the hidden dimension past the point when we can easily visualise it.

5️⃣ SAEs in Toy Models: We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition:

Animation of the training process for SAEs in Anthropic's toy model of superposition. Red = resampled latents. All instances eventually converge to accurately representing all 5 features learned by the original model.

6️⃣ Bonus: We cover some extension material here, including a replication of Deep Double Descent & Superposition, a paper which explores the idea that double descent happens when models transition from a memorizing solution (representing datapoints in superposition) to a generalizing solution (representing features in superposition).

How to use this material

The Colab notebooks are fully self-contained, you can work through the exercises Colab and check your answers by comparing them to the solutions Colab (which should also have all expected output displayed inline).

If you don't like working in Colabs, then you can clone the repo and work through them in VSCode. You have 2 options here: either go through the notebooks like normal (you can find Jupyter notebooks mirroring the structure of the Colabs at chapter1_transformer_interp/exercises/part32_interp_with_saes), or you can use a blank notebook / Python file and work through the exercises as shown on the Streamlit page.

Note that if you don't want to work through the material as exercises, then you can just use the solutions Colab / noteook as a source of reference code!

Please reach out to me if you have any questions or suggestions about these exercises (either by email at cal.s.mcdougall@gmail.com, or a LessWrong private message / comment on this post). Happy coding!

Sparse Autoencoders (SAEs)14Exercises / Problem-Sets2Interpretability (ML & AI)2Superposition1AI1

Frontpage

76 Ω 32

Mentioned in

109A Selection of Randomly Selected SAE Features

103Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

27An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs

Interpretability with Sparse Autoencoders (Colab exercises)

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:31 AM

[-]Jinge Wang6mo20

May I ask what's the relation and difference between the Exercises and Solutions? Should I read both of them?

[-]James Dao1y10

The neuron resampling part of the animation is so cool!

[-]CallumMcDougall1y20

Thanks (-:

[-]wassname1y10

Nice code!

In your SAE tutorials, the importance is just torch.ones or similar. I'm curious how importance might work for real models?

Is there any work where people derived it from backdrop or anything? I can't find any examples.

[-]CallumMcDougall1y30

Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it's usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!

As for what importance represents, it's basically a proxy for "how much a certain feature reduces loss, when it actually is present." This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:

Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.

If we're talking features in language models, then importance would be "average amount that this feature reduces cross entropy loss". I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn't actually show the "overall importance" of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn't. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the "django" token. This seems highly loss-reducing because (although sparse) it's very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it's pushing for something incorrect (maybe there exist other features which balance it out).

[-]wassname1y10

Thanks, that makes a lot of sense, I had skimmed the Anthropic paper and saw how it was used, but not where it comes from.

If it's the importance to the loss, then theoretically you could derive one using backprop I guess? E.g. the accumulated gradient to your activations, over a few batches.

[-]CallumMcDougall1y70

Yep, definitely! If you're using MSE loss then it's got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you're interested, I think Redwood's paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature

[-]wassname1y10

Huh, I actually tried this. Training IA3, which multiplies activations by a float. Then using that float as the importance of that activation. It seems like a natural way to use backprop to learn an importance matrix, but it gave small (1-2%) increases in accuracy. Strange.

I also tried using a VAE, and introducing sparsity by tokenizing the latent space. And this seems to work. At least probes can overfit to complex concept using the learned tokens.

[-]wassname1y10

Oh that's very interesting, Thank you.

[+][comment deleted]1y10

Deleted by wassname, 01/16/2024

Moderation Log