I use a pre-trained Sparse Autoencoder (SAE) to examine some features in a 33M-parameter, 2-layer TinyStories language model. I look at how frequently SAE features occur in the dataset and examine how the features are distributed over the neurons in the MLP that the SAE was trained on. I find that one neuron is the "main" direction that over 400 features are pointing in, and label some of those features. But I find that the most interpretable of these features are not mostly aligned with any neuron. I close with a few open questions that this investigation raised (and perhaps some of these open questions have already been answered by research out there!).
Intro
Neurons in neural networks are polysemantic, not monosemantic. This means that completely unrelated features (e.g., cats' faces and cars) in a dataset can cause the same neuron in a network to fire. This is problematic. Neurons are easy to examine and it would be ideal if we, humans, could interpret them and predict their behavior (and explain it to someone else like they're five). But we can't, so we need to understand why neurons are polysemantic and we need a different lens for examining and understanding AI models.
where xj is the activation vector for datapoint j, fi(xj) is the activation of feature i, whose direction is the unit vector ei in activation space. This concept wasn't new, but what was new was the use of a sparse autoencoder (SAE) trained on the neuron activations to recover the features. What's more -- many of the features recovered by SAEs are interpretable to humans!
This is really exciting, and offers the potential to intuitively understand a piece of transformers that are really hard to interpret (MLPs). I want to get my hands dirty playing with SAEs, and just to give me a place to start that doesn't involve looking at thousands of features, I'm specifically interested in the question: how many features can a neuron be tied up in? More specifically, if I pick a neuron in a model, for how many features is that neuron the maximally aligned neuron, and do those features have things in common?
A TinyStories SAE
TinyStories is a fun dataset that was generated to preserve the essential elements of natural language but not be as large and broad as, say, the whole internet. The stories often take the form of rather boring short stories about e.g., kids going on adventures. Many small transformer language models (1-8 layers and up to 33M parameters) trained on these datasets are shown to outperform GPT2-XL (1.5B parameters) on metrics of grammar, creativity, plot, and consistency in the original TinyStories paper.
I've downloaded the 2L, 33M parameter TinyStories model from Huggingface, and I've been given some pre-trained SAEs with 32768 features each (see 'code' section below for a few more details) that were trained on the MLP of the second and last layer of the model (layer 1).
The main SAE I'll be focusing on has the following characteristics:
For a few random imput prompt stories, the SAE reconstruction loss is on average greater than the original loss by ~25%. By comparison, if I zero-ablate the MLP, loss increases by 115% on average. So the SAEs are certainly encapsulating a good bit of what the model is doing.
There are on average about 30 features active per token.
Feature frequency
I ran ~700k tokens through the model and recorded how frequently features were active in each of the two SAEs that were provided to me. For each feature, I measure the feature sparsity (or "feature probability", the sum of tokens that feature is active on divided by total number of tokens). This is the histogram of how sparse each feature is:
So there's a few features that occur almost all the time (right side of plot), and a lot of features that occur once every 1000 or 10000 tokens. Let's say we wanted to develop a toy model of features kinda like in Anthropic's Toy Models of Superposition and subsequent Superposition, Memorization, and Double Descent paper. They set the sparsity of features like Si=f(i) (where i is the i-th feature and f is some function). So another possibly useful way of looking at this is to sort features by their sparsity or probability in descending order and plot that:
The two panels above are the same, but the one on the left is log-log and the one on the right is log-linear. Power-laws are straight lines on log-log plots, and exponentials are straight lines on log-linear plots. It looks like the low-sparsity features are pretty well defined by a power law, and the more sparse features (features numbers >= 1000) can be pretty well described either by either a power law or exponential falloff (to the chi-by-eye accuracy I'm using here). An exponential looks a bit better than a power law.
Distribution of neuron importance
I came into this mini-study feeling really interested in a basic question: in terms of features, how polysemantic can one neuron be? This is similar in nature to a question that drove some of what Neel Nanda wrote here and I'll follow some of his analyses.
For each feature, the decoder matrix in an SAE contains a vector in the MLP activation space corresponding to that feature. Vectors have directions. If a feature vector is 100% pointing along one dimension in activation space, then that feature is fully aligned with a single neuron. If it is pointing in some random off-axis direction, it can be aligned with many neurons. Here's some data to start understanding where features are pointing:
First, for each feature, I get its vector in activation space, fi=Wdec[i,:], then sort the neuron directions in that vector from largest to smallest. I take each entry, square them, and sum them to get the total power in the vector. In the left panel, I plot how much of that sum comes from the largest neuron entry (on the x-axis) and how much of that sum comes from the next 9 largest neurons (on the y-axis), just as in Neel's post. I find a very similar scatterplot. Here I've divided features based on whether they are sparse or not (see my first plot under the "Feature frequency" section. Features which occur fewer than once every 1000 tokens I consider sparse and plot in green; more frequently occurring tokens are plotted in orange). The two distributions mostly seem to occupy the same part of this space, but only non-sparse (orange) features seem to be able to have most of their power explained in the first one (or few) neurons, so they extend further right or further up in this plot.
In the middle panel I plot the histogram of Kurtosis of the feature vectors in activation space, similar to how Neel did in his post. High Kurtosis (I think?) suggests that there is a privileged basis in activation space (Kurtosis of 3 signifies Gaussian distribution). There's a tail extending to Kurtosis > 3000, but I just zoomed in on the bulk of the distribution.
Finally, for each feature, I find which neuron contains the most power (even if incrementally so) in activation space. I count the number of features each neuron is "responsible" for in this way, and plot a histogram of that in the right panel. It's not surprising to me that many neurons are only the primary direction of a small handful of features, but it is surprising to me that some neurons are the primary direction of many (hundreds!) of features. In particular, neuron 703 is the most-aligned neuron for 475 features; some of those features are sparse and some are not -- it's a pretty even distribution.
For kicks, I've gone ahead and investigated some of these features which point more in the direction of neuron 703 than any other neuron.
Features aligned with Neuron 703
I ran about 1.8M tokens through the model and found the top ten activating examples for each of the features most aligned with neuron 703. I originally used Callum McDougall's sae_visualizer to do this, but then made a lighter tool that focuses specifically on the maximum activations, the logits the feature boosts, and the neurons the feature is most aligned with.
I examined and labeled about 100 features with the largest activations as well as 50 features which are most strongly pointing in the direction of neuron 703. See the colab notebook to see all of these features. One very qualitative top-level takeaway: when I sorted by largest activations, I had a pretty easy time labeling the features, but the features were very polysemantic and neuron 703 did not contain much of their power in activation space. I found the features which had a significant fraction of their direction vector's power pointing toward the very polysemantic neuron 703 to be harder to interpret!
I wanted to find a prettier way to present this data, but…I also didn’t want to spend too much time beautifying an exploratory research project, so let me walk you quickly through what you’ll be looking at in the following images. I annotate each feature with its numbers in the SAE and my interpretation of what the feature is doing, then in the following ten lines I print:
The activation of the top-activating features (from top to bottom) and the associated sequence (with the activating token in bold orange).
After the purple |, I’m printing the top-10 logits that are most boosted by this feature (and the amount that this feature being active boosts those logits).
After the next purple |, I’m printing the most aligned neurons (with the magnitude of the aligned direction in parentheses), and I’m also printing the fraction of the output neuron vector’s power that is accounted for by each of the top ten neurons (e.g., e2i/∑ie2i, the square feature direction value in neuron index i over the summed power of the direction vector in neuron space).
'Once upon a time...' features
TinyStories has lots of stories that start with "Once upon a time...". Unsurprisingly, the model learns that stories often start this way, and here are some features showing this:
There were also a lot of "The moral of the story..." features.
Exciting features!
Many features related to happiness or excitement and tried to point the model towards completing an exclamation point:
Action features
Lots of features try to fill in a plausible verb based on the context of the sentence, including these three features which fire for verbs with -ing! (present participles? But this is past tense...)
Object features
I really like these "object features" which fire on different types of objects in a fixed context.
Conclusion / Some interesting questions that this raised
There are literally tens of thousands of features I didn't explore here. I'm sure if I looked at all 33k then I'd see a lot of patterns emerge. Still, I'm left with some opening questions that I'd like to poke around a bit more in future research:
What is the underlying feature distribution of the English language? How can we quantify that? Do SAEs pick out the right feature distribution or is the distribution they find pathological in some way (because of training or hyperparameters)?
Lots of these features feel really hyper-specific. What do we gain by knowing what they are? I guess if they are universal, e.g., there are universal features in the human language for "once upon a time" completion, then it could be useful to annotate them in one model and look for them in another? But then -- how do we automate the labeling of features that an SAE picks out of a new model? (Probably pass them to an LLM?)
Most of the features that I looked at don't have most of their direction vector power described by the first 10 neurons. But some do! Is there some kind of interpretable characteristic grouping for the features that are mostly described by the top-1 neuron? How about the ones that are mostly described by the next 10 neurons? Seems like a tractable set of neurons to actually look at. Also, why were the features that had more of their power explained by neuron 703 (qualitatively) less interpretable than those that were extremely polysemantic?
Another thing: I had a bug in my code for a while where I was passing the layer 0 MLP activations into the SAE (which was trained on layer 1). I'm a bit surprised that I found that the top activating tokens sometimes still seemed interpretable! E.g., they would all activate on a token like "\n" in a given context, etc. So even though the feature directions in activation space were learned in layer 1, they still seemed to pick out the direction of possibly interpretable features in layer 0! This makes me wonder if there's some sort of universality in the directions that features are stored in activation space, and it would be really interesting to compare the directions of different SAE features trained on different layers of the same model.
Code
All of the work was done using a colab notebook, which can be found here. If you want to use Callum's sae_visualizer in that notebook, be sure to follow the instructions in the "Visualizations from Callum" section of the notebook to modify data_fns.py to work with this tinystories model. Trained SAEs will soon be available from Lovis Heindrich on huggingface (I'll update this post when they are); until then, they should be online here; I examined 185_upbeat_field and 189_giddy_water in this post.
Acknowledgments
Big thanks to Neel Nanda for adding me as an auditor to his MATS slack and to Joseph Bloom for encouraging me to take a look at TinyStories and SAEs more broadly. Thanks to Lovis Heindrich and Lucia Quirke for training and providing the SAEs. Also a big thanks to Callum McDougall for his SAE exercises which guided the code I used in this work. Thanks to Adam Jermyn for explaining how to go about looking at feature outputs, and to Eoin Farrell for helpful discussions about how SAEs work.
Summary
I use a pre-trained Sparse Autoencoder (SAE) to examine some features in a 33M-parameter, 2-layer TinyStories language model. I look at how frequently SAE features occur in the dataset and examine how the features are distributed over the neurons in the MLP that the SAE was trained on. I find that one neuron is the "main" direction that over 400 features are pointing in, and label some of those features. But I find that the most interpretable of these features are not mostly aligned with any neuron. I close with a few open questions that this investigation raised (and perhaps some of these open questions have already been answered by research out there!).
Intro
Neurons in neural networks are polysemantic, not monosemantic. This means that completely unrelated features (e.g., cats' faces and cars) in a dataset can cause the same neuron in a network to fire. This is problematic. Neurons are easy to examine and it would be ideal if we, humans, could interpret them and predict their behavior (and explain it to someone else like they're five). But we can't, so we need to understand why neurons are polysemantic and we need a different lens for examining and understanding AI models.
Researchers at e.g., EleutherAI and Anthropic's interpretability team, came up with a technique a few months ago that involves expressing a neuron activation vector in terms of a linear sum of features,
xj≈b+∑ifi(xj)ei,
where xj is the activation vector for datapoint j, fi(xj) is the activation of feature i, whose direction is the unit vector ei in activation space. This concept wasn't new, but what was new was the use of a sparse autoencoder (SAE) trained on the neuron activations to recover the features. What's more -- many of the features recovered by SAEs are interpretable to humans!
This is really exciting, and offers the potential to intuitively understand a piece of transformers that are really hard to interpret (MLPs). I want to get my hands dirty playing with SAEs, and just to give me a place to start that doesn't involve looking at thousands of features, I'm specifically interested in the question: how many features can a neuron be tied up in? More specifically, if I pick a neuron in a model, for how many features is that neuron the maximally aligned neuron, and do those features have things in common?
A TinyStories SAE
TinyStories is a fun dataset that was generated to preserve the essential elements of natural language but not be as large and broad as, say, the whole internet. The stories often take the form of rather boring short stories about e.g., kids going on adventures. Many small transformer language models (1-8 layers and up to 33M parameters) trained on these datasets are shown to outperform GPT2-XL (1.5B parameters) on metrics of grammar, creativity, plot, and consistency in the original TinyStories paper.
I've downloaded the 2L, 33M parameter TinyStories model from Huggingface, and I've been given some pre-trained SAEs with 32768 features each (see 'code' section below for a few more details) that were trained on the MLP of the second and last layer of the model (layer 1).
The main SAE I'll be focusing on has the following characteristics:
Feature frequency
I ran ~700k tokens through the model and recorded how frequently features were active in each of the two SAEs that were provided to me. For each feature, I measure the feature sparsity (or "feature probability", the sum of tokens that feature is active on divided by total number of tokens). This is the histogram of how sparse each feature is:
So there's a few features that occur almost all the time (right side of plot), and a lot of features that occur once every 1000 or 10000 tokens. Let's say we wanted to develop a toy model of features kinda like in Anthropic's Toy Models of Superposition and subsequent Superposition, Memorization, and Double Descent paper. They set the sparsity of features like Si=f(i) (where i is the i-th feature and f is some function). So another possibly useful way of looking at this is to sort features by their sparsity or probability in descending order and plot that:
The two panels above are the same, but the one on the left is log-log and the one on the right is log-linear. Power-laws are straight lines on log-log plots, and exponentials are straight lines on log-linear plots. It looks like the low-sparsity features are pretty well defined by a power law, and the more sparse features (features numbers >= 1000) can be pretty well described either by either a power law or exponential falloff (to the chi-by-eye accuracy I'm using here). An exponential looks a bit better than a power law.
Distribution of neuron importance
I came into this mini-study feeling really interested in a basic question: in terms of features, how polysemantic can one neuron be? This is similar in nature to a question that drove some of what Neel Nanda wrote here and I'll follow some of his analyses.
For each feature, the decoder matrix in an SAE contains a vector in the MLP activation space corresponding to that feature. Vectors have directions. If a feature vector is 100% pointing along one dimension in activation space, then that feature is fully aligned with a single neuron. If it is pointing in some random off-axis direction, it can be aligned with many neurons. Here's some data to start understanding where features are pointing:
First, for each feature, I get its vector in activation space, fi=Wdec[i,:], then sort the neuron directions in that vector from largest to smallest. I take each entry, square them, and sum them to get the total power in the vector. In the left panel, I plot how much of that sum comes from the largest neuron entry (on the x-axis) and how much of that sum comes from the next 9 largest neurons (on the y-axis), just as in Neel's post. I find a very similar scatterplot. Here I've divided features based on whether they are sparse or not (see my first plot under the "Feature frequency" section. Features which occur fewer than once every 1000 tokens I consider sparse and plot in green; more frequently occurring tokens are plotted in orange). The two distributions mostly seem to occupy the same part of this space, but only non-sparse (orange) features seem to be able to have most of their power explained in the first one (or few) neurons, so they extend further right or further up in this plot.
In the middle panel I plot the histogram of Kurtosis of the feature vectors in activation space, similar to how Neel did in his post. High Kurtosis (I think?) suggests that there is a privileged basis in activation space (Kurtosis of 3 signifies Gaussian distribution). There's a tail extending to Kurtosis > 3000, but I just zoomed in on the bulk of the distribution.
Finally, for each feature, I find which neuron contains the most power (even if incrementally so) in activation space. I count the number of features each neuron is "responsible" for in this way, and plot a histogram of that in the right panel. It's not surprising to me that many neurons are only the primary direction of a small handful of features, but it is surprising to me that some neurons are the primary direction of many (hundreds!) of features. In particular, neuron 703 is the most-aligned neuron for 475 features; some of those features are sparse and some are not -- it's a pretty even distribution.
For kicks, I've gone ahead and investigated some of these features which point more in the direction of neuron 703 than any other neuron.
Features aligned with Neuron 703
I ran about 1.8M tokens through the model and found the top ten activating examples for each of the features most aligned with neuron 703. I originally used Callum McDougall's sae_visualizer to do this, but then made a lighter tool that focuses specifically on the maximum activations, the logits the feature boosts, and the neurons the feature is most aligned with.
I examined and labeled about 100 features with the largest activations as well as 50 features which are most strongly pointing in the direction of neuron 703. See the colab notebook to see all of these features. One very qualitative top-level takeaway: when I sorted by largest activations, I had a pretty easy time labeling the features, but the features were very polysemantic and neuron 703 did not contain much of their power in activation space. I found the features which had a significant fraction of their direction vector's power pointing toward the very polysemantic neuron 703 to be harder to interpret!
I wanted to find a prettier way to present this data, but…I also didn’t want to spend too much time beautifying an exploratory research project, so let me walk you quickly through what you’ll be looking at in the following images. I annotate each feature with its numbers in the SAE and my interpretation of what the feature is doing, then in the following ten lines I print:
'Once upon a time...' features
TinyStories has lots of stories that start with "Once upon a time...". Unsurprisingly, the model learns that stories often start this way, and here are some features showing this:
There were also a lot of "The moral of the story..." features.
Exciting features!
Many features related to happiness or excitement and tried to point the model towards completing an exclamation point:
Action features
Lots of features try to fill in a plausible verb based on the context of the sentence, including these three features which fire for verbs with -ing! (present participles? But this is past tense...)
Object features
I really like these "object features" which fire on different types of objects in a fixed context.
Conclusion / Some interesting questions that this raised
There are literally tens of thousands of features I didn't explore here. I'm sure if I looked at all 33k then I'd see a lot of patterns emerge. Still, I'm left with some opening questions that I'd like to poke around a bit more in future research:
Another thing: I had a bug in my code for a while where I was passing the layer 0 MLP activations into the SAE (which was trained on layer 1). I'm a bit surprised that I found that the top activating tokens sometimes still seemed interpretable! E.g., they would all activate on a token like "\n" in a given context, etc. So even though the feature directions in activation space were learned in layer 1, they still seemed to pick out the direction of possibly interpretable features in layer 0! This makes me wonder if there's some sort of universality in the directions that features are stored in activation space, and it would be really interesting to compare the directions of different SAE features trained on different layers of the same model.
Code
All of the work was done using a colab notebook, which can be found here. If you want to use Callum's sae_visualizer in that notebook, be sure to follow the instructions in the "Visualizations from Callum" section of the notebook to modify data_fns.py to work with this tinystories model. Trained SAEs will soon be available from Lovis Heindrich on huggingface (I'll update this post when they are); until then, they should be online here; I examined 185_upbeat_field and 189_giddy_water in this post.
Acknowledgments
Big thanks to Neel Nanda for adding me as an auditor to his MATS slack and to Joseph Bloom for encouraging me to take a look at TinyStories and SAEs more broadly. Thanks to Lovis Heindrich and Lucia Quirke for training and providing the SAEs. Also a big thanks to Callum McDougall for his SAE exercises which guided the code I used in this work. Thanks to Adam Jermyn for explaining how to go about looking at feature outputs, and to Eoin Farrell for helpful discussions about how SAEs work.