LESSWRONG
LW

Evan Anders — LessWrong

Notes:

This research was performed as part of Adrià Garriga-Alonso’s MATS 6.0 stream.
If an opinion is stated in this post saying that "we" hold the opinion, assume it's Evan's opinion (Adrià is taking a well-deserved vacation at the time of writing).
Evan won’t be able to continue working on this research direction, because he’s going to be offline before starting a new job at Anthropic in September! In that light, please view this post as something between a final research writeup, a peek into a lab notebook of some experiments we tried, and a pedagogical piece explaining some of the areas where Evan got stuck and had to dig in and learn things during MATS.

... (read 7265 more words →)

Replying toSparse autoencoders find composed features in small toy models

Sparse autoencoders find composed features in small toy models

Hi Lawrence! Thanks so much for this comment and for spelling out (with the math) where and how our thinking and dataset construction were poorly setup. I agree with your analysis and critiques of the first dataset. The biggest problem with that dataset in my eyes (as you point out): the true actual features in the data are not the ones that I wanted them to be (and claimed them to be), so the SAE isn't really learning "composed features."

In retrospect, I wish I had just skipped onto the second dataset which had a result that was (to me) surprising at the time of the post. But there I hadn't thought about... (read more)

Replying toSparse autoencoders find composed features in small toy models

Evan Anders2y

Sparse autoencoders find composed features in small toy models

Hi Demian! Sorry for the really slow response.

Yes! I agree that I was surprised that the decoder weights weren't pointing diagonally in the case where feature occurrences were perfectly correlated. I'm not sure I really grok why this is the case. The models do learn a feature basis that can describe any of the (four) data points that can be passed into the model, but it doesn't seem optimal either for L1 or MSE.

And -- yeah, I think this is an extremely pathological case. Preliminary results look like larger dictionaries finding larger sets of features do a better job of not getting stuck in these weird local minima, and the possible number of interesting experiments here (varying frequency, varying SAE size, varying which things are correlated) is making for a pretty large exploration space.

Replying toSparse autoencoders find composed features in small toy models

Evan Anders2y

Sparse autoencoders find composed features in small toy models

Hi Ali, sorry for my slow response, too! Needed to think on it for a bit.

Yep, you could definitely generate the dataset with a different basis (e.g., [1,0,0,0] = 0.5*[1,0,1,0] + 0.5*[1,0,-1,0]).
I think in the context of language models, learning a different basis is a problem. I assume that, there, things aren't so clean as "you can get back the original features by adding 1/2 of that and 1/2 of this". I'd imagine it's more like feature1 = "the in context A", feature 2 = "the in context B", feature 3 = "the in context C". And if the is a real feature (I'm not sure it is), then I don't know

Evan Anders2y

Sparse autoencoders find composed features in small toy models

Hi Logan! Thanks for pointing me towards that post -- I've been meaning to get around to reading it in detail and just finally did. Glad to see that the large-N limit seems to get perfect reconstruction for at least one similar toy experiment! And thanks for sharing the replication code.

I'm particularly keen to learn a bit more about the correlated features -- did you (or do you know of anyone) who has studied toy models where they have a few features that are REALLY correlated with one another, and that basically never appear with other features? I'm wondering if such features could bring back the problem that we saw here, even in a very high-dimensional model / dataset. Most of the metrics in that post are averaged over all features, so don't really differentiate between correlated or not, etc.

Replying toSparse autoencoders find composed features in small toy models

Evan Anders2y

Sparse autoencoders find composed features in small toy models

Thanks for the comment! Just to check that I understand what you're saying here:

We should not expect the SAE to learn anything about the original choice of basis at all. This choice of basis is not part of the SAE training data. If we want to be sure of this, we can plot the training data of the SAE on the plane (in terms of a scatter plot) and see that it is independent of any choice of bases.

Basically -- you're saying that in the hidden plane of the model, data points are just scattered throughout the area of the unit circle (in the uncorrelated case) and in the case of one... (read more)

Sparse autoencoders find composed features in small toy models

Evan Anders

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, Jessica N. Howard

Summary

Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an $ℓ_{1}$ penalty on the SAE hidden layer activations.
Problem & Hypothesis: While the SAE $ℓ_{1}$ penalty achieves sparsity, it has been argued that it can also cause SAEs to learn commonly-composed features rather than the “true” features in the underlying data.
Experiment: We propose a modified setup of Anthropic’s ReLU Output Toy Model where data vectors are made up of sets of composed features. We study the simplest possible version of this toy model with two hidden dimensions for ease of comparison to many of Anthropic’s visualizations.
- Features in a given set in our data are anticorrelated, and

... (read 4326 more words →)

Replying toExamining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders2y

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Ah! That's the context, thanks for the clarification and for pointing out the error. Yes "problems" should say "prompts"; I'll edit the original post shortly to reflect that.

Replying toExamining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders2y

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Oh! You're right, thanks for walking me through that, I hadn't appreciated that subtlety. Then in response to the first question: yep! CE = KL Divergence.

Replying toExamining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders2y

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

After seeing this comment, if I were to re-write this post, maybe it would have been better to use the KL Divergence over the simple CE metric that I used. I think they're subtly different.

Per the TL implementation for CE, I'm calculating: CE $_{j}$ = $\frac{1}{N} \sum_{i} ln p_{i j}$ where $i$ is the batch dimension and $j$ is context position.

So $Δ$ CE $_{j}$ = $\frac{1}{N} \sum_{i} (ln q_{i j} - ln p_{i j})$ for $p_{i j}$ the baseline probability and $q_{i j}$ the patched probability.

So this is missing a factor of $p_{i j}$ to be the true KL divergence.

Replying toExamining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders2y

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

I think this is most of what the layer 0 SAE gets wrong. The layer 0 SAE just reconstructs the activations after embedding (positional + token), so the only real explanation I see for what it's getting wrong is the positional embedding.

But I'm less convinced that this explains later layer SAEs. If you look at e.g., this figure:

then you see that the layer 0 model activations are an order of magnitude smaller than any later-layer activations, so the positional embedding itself is only making up a really small part of the signal going into the SAE for any layer > 0 (so I'm skeptical that it's accounting for a large fraction... (read more)

Replying toExamining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders2y

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

For me, this was actually a positive update that SAEs are pretty good on distribution -- you trained SAE on length 128 sequences from OpenWebText, and the log loss was quite low up to ~200 tokens! This is despite its poor downstream use case performance.

Yes, this was nice to see. I originally just looked at context positions at powers of 2 (...64, 128, 256,...) and there everything looked terrible above 128, but Logan recommended looking at all context positions and this was a cool result!

But note that there's a layer effect here. I think layer 12 is good up to ~200 tokens while layer 0 is only really good up to the... (read more)

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Evan Anders

Evan Anders, Joseph Bloom

Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed.

Summary

Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models, but SAEs don’t reconstruct activations perfectly. We lack good metrics for evaluating which parts of model activations SAEs fail to reconstruct, which makes it hard to evaluate SAEs themselves. In this post, we argue that SAE reconstructions should be tested using well-established benchmarks to help determine what kinds of tasks they degrade model performance on.
We stress-test a recently released set of SAEs for each layer of the gpt2-small residual stream using randomly sampled tokens from Open WebText and the Lambada

... (read 4278 more words →)

How polysemantic can one neuron be? Investigating features in TinyStories.

Evan Anders

Summary

I use a pre-trained Sparse Autoencoder (SAE) to examine some features in a 33M-parameter, 2-layer TinyStories language model. I look at how frequently SAE features occur in the dataset and examine how the features are distributed over the neurons in the MLP that the SAE was trained on. I find that one neuron is the "main" direction that over 400 features are pointing in, and label some of those features. But I find that the most interpretable of these features are not mostly aligned with any neuron. I close with a few open questions that this investigation raised (and perhaps some of these open questions have already been answered by research out... (read 2267 more words →)

How does a toy 2 digit subtraction transformer predict the difference?

Evan Anders

2 digit subtraction -- difference prediction

Summary

I continue studying a toy 1-layer transformer language model trained to do two-digit addition of the form $a - b = \pm c$ . After predicting if the model is positive (+) or negative (-) it must output the difference between $a$ and $b$ . The model creates activations which are oscillatory in the a and b bases, as well as activations which vary linearly as a function of $a - b$ . The model uses the activation function to couple oscillations across the $a$ and $b$ directions, and it then sums those oscillations to eliminate any variance except for that depending on the absolute difference $| a - b |$ to predict the correct output token. I examine the full path of this algorithm from input to model output.

Intro

In previous posts,... (read 2805 more words →)

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders

Summary

I examine a toy 1-layer transformer language model trained to do two-digit addition of the form $a - b = \pm c$ . This model must first predict if the outcome is positive (+) or negative (-) and output the associated sign token. The model learns a classification algorithm wherein it sets the + token to be the default output, but when $b > a$ the probability of the - token linearly increases while the probability of the + token linearly decreases. I examine this algorithm from model input to output.

Intro

In my previous post (which was a whole week ago before I knew about lesswrong!), I briefly discussed a 1-layer transformer model that I've trained to do two-digit subtraction. The model receives... (read 2391 more words →)

LESSWRONG
LW

LESSWRONG
LW

Evan Anders

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Sparse autoencoders find composed features in small toy models

Crafting Polysemantic Transformer Benchmarks with Known Circuits

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders

Evan Anders

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Sparse autoencoders find composed features in small toy models

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

How polysemantic can one neuron be? Investigating features in TinyStories.

How does a toy 2 digit subtraction transformer predict the difference?

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

Sparse autoencoders find composed features in small toy models

Crafting Polysemantic Transformer Benchmarks with Known Circuits

How does a toy 2 digit subtraction transformer predict the sign of the output?

Evan Anders

Evan Anders

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Sparse autoencoders find composed features in small toy models

Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders

How polysemantic can one neuron be? Investigating features in TinyStories.

How does a toy 2 digit subtraction transformer predict the difference?

How does a toy 2 digit subtraction transformer predict the sign of the output?

Summary

Summary

Summary

2 digit subtraction -- difference prediction

Summary

Intro

Summary

Intro