Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).
We also see a similar 'splitting' of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. 'Monday'), but the activation (of lack thereof) of the parent latent corresponds to whether there is a space in the token (e.g. ' Monday') (see here). When the parent and child are active, both have roughly half the activation strength of the child when it is active alone, which I think is similar to what you observe, although made more complex because we do not know the underlying true features in this case. If this holds in general, perhaps it would be possible to improve your method for preventing co-occurrence/absorption by looking not only for cases of splits in the activation density, but for the activation strengths between pairs of features being strongly coupled/proportional in such a manner?
The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don't have a good way to keep the latents themselves from firing together and thus showing up as a co-firing latent. Untied SAEs can more easily do an absorptiony thing and turn off one latent when the other fires, for example, even if they both encode similar underlying features.
I think a next step for this work is to try to do clustering of activations based on their position in the activation density histogram of latents. I expect we should see some of the same clusters being present across multiple latents, and that those latents should also co-fire together to some extent.
The two other things in your work that feel important are the idea of models using low activations as a form of "uncertainty", and non-linear features like days of the week forming a circle. The toy examples in our work here assume that both of these things don't happen, that features basically fire with a set magnitude (maybe with some variance), and the directions of features are mutually orthogonal (or mostly mutually orthogonal). In the case of models using low activations to signal uncertainty, we won't necessarily see a clean peak in the activation histogram for the feature activating, or the width of the activation peak might look very large. In the case of features forming a circle, then the underlying directions are not mutually orthogonal, and this will also likely show up as more activation peaks in the activation density histograms of latents representing these circular concepts, but those peaks won't correspond to parent/child relationships and absorption but instead just the fact that different vectors on a circle all project onto each other.
Do you think your work can be extended to automatically classify if an underlying feature is a circular or non-linear feature, or is in a parent/child relationship, and if the underlying feature doesn't basically fire with a set magnitude but instead uses magnitude as uncertainty? It would be great to have a sense of what portion of features in a model are of which sorts (set magnitude vs variable magnitude, mostly orthogonal direction vs forming a geometric shape with related features, parent/child, etc...). For the method we present here, it would be helpful to know if an activation density peak is an unwanted parent or child feature component that should project out of the latent, vs something that's intrisically part of the latent (e.g. just the same feature with a lower magnitude, or a circular geometric relationship with related features)
Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post.
In a previous post on Toy Models of Feature Absorption, we showed that tied SAEs seem to solve feature absorption. However, when we tried to training some tied SAEs on Gemma 2 2b, these still appeared to suffer from absorption effects (or something similar). In this post, we explore how this is possible by extending our investigation to toy settings where the SAE has more or fewer latents than true features. We hope this will build intuition for how SAEs work and what sorts of failure modes they have. Some key takeaways:
We use the term "absorption" loosely above to mean the SAE latents are learning messed-up combinations of features rather than each latent matching a single true feature. Our goal is for the SAE latents to have a 1-to-1 match with a true feature direction. We refer to this undesirable feature mixing as "broken latents" for the rest of this post to cover all cases where the SAE learns incorrect representations.
The code for this post is in this Colab Notebook.
Background: Absorption and Tied SAEs
Feature absorption is a degenerate form of feature splitting involving a hierarchical relationship between parent and child features, where the child feature is active whenever the parent feature is active. In feature absorption, the SAE learns a latent which seems to track the parent feature and the child feature. However, the parent latent fails to activate when the child latent is active. In addition, the child latent absorbs a component of the parent latent into its decoder. The parent latent is effectively gerrymandered, with an exception in its firing pattern when the child latent is active.
We incentivize SAEs to have sparse latent activations, so the SAE will try to minimize the number of active latents needed to represent any given input. Absorption is a logical consequence of this sparsity: If a parent feature activates every time a child feature activates, the SAE can just fire one latent to represent both the child and parent feature together whenever the child feature is active. However, this results in a less interpretable latents, as a latent which seems to track the parent feature is actually tracking the parent feature with exceptions. The latent tracking the child feature ends up mixing both parent and child feature representations together in its decoder.
We first noticed absorption in Gemma Scope SAEs in our paper A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. We then demonstrated feature absorption in a toy model.
To recap the main finding in the Toy Models of Feature Absorption post, we first considered a toy setting with 4 true features, each represented by a 50 dim vector, where all features are mutually orthogonal. In this setting, every time that feature 1 (the child feature) is active, feature 0 (the parent feature) must also be active, inducing a co-occurrence relation between these features. We construct training samples by randomly sampling feature firings and summing the resulting feature vectors.
We trained a standard L1 loss untied SAE with 4 latents to reconstruct these activations. Below, we see the cosine similarity of the trained SAE latents with the underlying true features for both the SAE encoder and decoder.
We see that the independently firing features, feature 2 and feature 3 are perfectly detected and reconstructed by the SAE. However, latent 0 which tracks feature 0 fires when feature 0 is active and feature 1 is NOT active. Instead, when feature 1 is active, the SAE decoder for feature 1 contains the sum of features 0 and 1. We summarize this below:
Clearly, this is not ideal! We want each latent to detect and reconstruct a true feature, not a mixture of features with exceptions.
However, one insight from the untied SAE absorption case above is that there is an asymmetry between the encoder and the decoder necessary to create absorption. What if we use a tied SAE where the encoder and the decoder must be identical instead?
Indeed, using a tied SAE solves absorption in this simple case. For a more in-depth overview of this toy setting, and further experiments including a superposition setup, see the original Toy Models of Feature Absorption post.
After our work showing that tied SAEs seem to solve absorption in our toy setting, we naturally tried training some tied jumprelu SAEs on a real LLM (Gemma-2-2b) to check if we've solved absorption and fully broken-down latents into their constituent parts, and found that we still could detect some problematic patterns in these tied SAEs. How could this be possible?
Important Terms
Untied Sparse Autoencoder: Our Sparse Autoencoders (SAEs) are characterized by the equation below, where Wdec∈RD×K is the SAE decoder, Wenc∈RK×D is the SAE encoder, K is the SAE hidden size, D is the SAE input size, bdec∈RD is the decoder bias, benc∈RK is the SAE encoder bias, and σ is a non-linearity, typically ReLU or JumpReLU:
h=σ(Wenc(x−bdec)+benc)^x=Wdech+bdecWe refer to these standard SAEs as untied SAEs to differentiate them from Tied SAEs below.
Tied Sparse Autoencoder: Our tied SAE is the same as our untied SAE, except with the following constraints: Wenc=WTdec and benc=0. In tied SAEs, we use to W to mean Wdec and b to mean bdec as below:
h=σ(WTx−b)^x=Wh+bTied SAEs were used in early dictionary learning for interpretability work[3], but fell out of favor after Anthropic stopped using tied SAEs in Towards Monosemanticity.
Parent and child features: When we investigate feature co-occurrence, we construct feature firing patterns when one feature must be active when another feature is active. This is typical of features in a hierarchy, for instance "animal" and "dog". If feature 0 is "animal" and feature 1 is "dog", then whenever feature 1 is active feature 0 must also be active, since a dog is an animal. We refer to feature 0 here as the "parent feature" in the relationship, and feature 1 as a "child feature". There can be multiple child features for each parent feature.
What happens if the SAE has more latents than true features?
In practice we don't know how many true features exist in a deep learning model, so we'll almost never have an SAE that has the exact number of latents as true features. While it seems unlikely we'd be able to train a SAE that has too many features on a LLM foundation model, we could imagine this happening for smaller models, for example game-playing models[4][5].
We begin with a setup containing 4 true features each firing independently with magnitude 1.0. All features have 20 dim representations, and are mutually orthogonal, so no superposition. In this setup, all features fire with probability 0.2. We begin by training an untied SAE with 8 latents on this toy setup.
A perfect SAE should learn the 4 true features and allow the remaining 4 features to die off.
Untied SAEs misuse excess capacity
Below, we plot the cosine similarity of the encoder and decoder of the untied SAE with the true features.
The SAE learns a mix of correct and broken latents. two of the excess are correctly killed off (latents 1 and 4), but one latent is a duplicate (latents 6 and 7). The SAE also learns a feature for the combination of features 0 and 1 together, in addition to features 0 and 1 firing separately. This is very similar to feature absorption, as the SAE learns to not fire the latents tracking feature 0 and 1 on their own when this combo latent is active. This sort of problematic combo latent was predicted by previous work[6][7]. Below, we see some firing patterns for sample true features.
When features 0 and 1 activate together, the SAE activates only the combo latent, and thus the L1 is less than the sum of true feature magnitudes. The SAE has found a way to use its extra capacity to "cheat" and find ways of representing features with fewer latents than it should.
Tied SAEs learn perfect representations despite excess capacity
We run the same experiment above, but using a tied SAE. below is the resulting cosine similarity between true features and SAE decoder latents. Since tied SAEs have identical encoder and decoder, we only present the decoder.
The tied SAE learns to perfectly represent the true features, with one latent per true feature. The SAE kills off all latents which do not map onto a true feature
Tied SAEs continue to solve absorption with more latents than true features
Next, we add co-occurrence to the feature firing patterns, setting up a parent / child relationship with feature 0 as parent feature and features 1 and 2 as child features. This means every time feature 1 or 2 fires, feature 0 must also fire, but feature 0 can also fire on its own. This sort of co-occurrence pattern would normally cause feature absorption in untied SAEs. We only investigate this with a tied SAE, as untied SAEs already learn broken latents for the independent features case, and we have already shown that untied SAEs suffer from feature absorption when training on co-occurring features.
We still see the tied SAE is able to perfectly reconstruct the true features despite the feature co-occurrence!
What happens if the SAE has fewer latents than features?
When an SAE has fewer latents than there are true features, the SAE will have to pick which features to represent and which to ignore. We use a toy setting with 20 features in 50 dimensions. These feature are thus fully orthogonal and there is no superposition (we will examine superposition later in the post). The firing probability of these 20 features increases linearly to 0.3 from index 0 to 19, and the magnitude of the features linearly decreases from 20 at index 0 to 1 at index 19.
Below we plot the magnitudes, firing probabilities, and expected MSE (probability × magnitude²) for each feature below.
For experiments in this section, we use a SAE with 5 latents. Since SAEs are trained using MSE loss, we expect that the SAE will choose to represent the features with the largest expected MSE (probability × magnitude²). In our toy setup with 5 latents , this corresponds to features 4,5,6,7, and 8.
Below we train a 5-latent tied SAE on theses features, with all features being mutually indepedent.
Here, we see the SAE perfectly reconstructs features 4-8, as we predicted. Below we try the same experiment using a 5-latent untied SAE.
The untied SAE also perfectly reconstructs the 5 true features we predicted by max expected MSE, features 4-8.
Co-occurrence breaks tied SAEs
Next, we introduce a co-occurrence relationship where anytime that features 5 or 7 fire, feature 12 must also fire. This means that feature 12 is a parent feature and features 5 and 7 are child features in our hierarchical setup.
The tied SAE no longer learns clean representations for features 5 and 7. Both these latents now include a portion of feature 12. Features 5 and 7 each also include a negative component of each other. Since the SAE can no longer achieve perfect reconstruction, it settles into this mixed representation instead. The negative component between features 5 and 7 is likely to compensate for the situation when both latents fire together and thus too large a component of feature 12 would be included in the reconstruction.
It's not obvious that we should call this phenomenon "absorption", but it's clearly problematic.
Tied SAEs have a bias for orthogonal latents
A natural idea to try to fix the above broken latents would be to add a loss to force latents to be orthogonal to each other. However, this won't help here, because the latents are all already orthogonal! Below is a plot of the cosine similarities of the learned SAE latents to each other:
Tied SAEs can only reduce interference between latents by making them orthogonal to each other, so tied SAEs are heavily incentivized to learn mutually-orthogonal latents. This orthogonality bias is likely why tied SAEs perform better than untied SAEs with feature co-occurrence in general.
Multiple activation peaks indicate absorption
If we investigate the activation magnitudes on a sample of activations for these latents, we notice the following pattern:
For latents 1,2 and 3, there is only a single peak in the latent activations. However, for latents 0 and 4 which correspond to the broken merged latents, there are 4 visible peaks. When the main feature tracked by these broken latents is active, the latent fires strongly. However, when that main feature is not active and feature 12 is active on its own, the latent fires weakly. This is shown for latent 4 below:
This asymmetry in the activation magnitudes for latent 4 is caused by the fact that sometimes feature 12 fires on its own, and sometimes it fires with feature 7. When feature 12 fires on its own, the latent only activates weakly. The 2 variations on high and low activations come from the negative component of feature 5 in latent 4. Can we just force the latent to be orthogonal to activations which would cause the latent to fire weakly? Removing the feature 12 component from latent 4 should also remove the incentive for latent 4 to learn a negative component of feature 5.
Incentivizing a single activation peak
In a real SAE, we don't know what the ground-truth features are, but we can pretty easily find the activation magnitudes of each latent by testing out the SAE on sample model inputs. If low-activating values of a latent correspond to mixtures of underlying features, and the highest-activating cluster corresponds to a real feature we want to track, we can just set a threshold somewhere in the middle and penalize the SAE latent for not being orthogonal to any activation where the latent fires below that threshold.
We adjust our training procedure as follows:
The loss term is defined below, where B is the batch size, K is the number of latents, τj is the threshold for the latent j, Wj∈RD is the decoder representation of latent j, and Laux is the auxiliary loss coefficient:
Laux=λauxBKB∑i=0K∑j=0{0if hi,j>τjcos(xi−b,Wj)2otherwiseWe now train a SAE using our new training method with the following hyperparams:
Using this new training scheme, we again perfectly recover true features despite co-occurrence!
Co-occurrence with a high MSE feature
In our examples so far, we've used feature 12 as the parent feature for co-occurrence. Feature 12 has a small enough expected MSE that the SAE would not have tried to represent it anyway. What happens if we make the parent feature be a feature the SAE does represent already?
Next, we change our co-occurrence setup so that any time features 5 or 7 fire, feature 6 must also fire. This means feature 6, the feature with the highest expected MSE, is now the parent feature. Below we train a standard tied SAE on this setup:
Again, we see merged combinations of latents 5,6, and 7. Interestingly, the SAE no longer devotes a single latent to each of features 5,6,7, devoting only 2 latents to the combinations of these 3 features. Instead, the SAE is now representing feature 3 as well.
Next, we use our new training setup to see if this will address the combo latents we see above.
Here we see the SAE is now correctly representing true features! However, something strange has happened to feature 6. The SAE no longer represents feature 6 at all, despite this feature resulting in the highest expected MSE loss of all. This is probably an unintended side-effect of our orthogonality loss making it difficult for the SAE to move the latent currently tracking feature 3 to a new position tracking feature 6. Still, at least all the latents are now perfectly tracking true features.
Superposition
So far, our toy setup has fully orthogonal features and thus no superposition. Next, we'll reduce the dimensions of our true features to 19, so each feature cannot be fully orthogonal to each other feature. We still try to make these features as orthogonal as possible, resulting in features with cosine similarity of ±0.05 with each other. The cosine similarities of true features with each other is shown below:
We begin by training a standard tied SAE using these features along with the same probabilities and magnitudes from earlier experiments. We continue with the co-occurrence pattern from the previous experiment, where feature 6 must fire when either feature 5 or feature 7 fires. We increase the l1 coefficient to 1e-2 and train for 150 Million samples.
We see a noisier version of the non-superposition case, where features 3, 4, and 8 are clearly tracked, but features 5, 6, and 7 are mixed together. We now train using our modified SAE training regime:
We see again roughly what we saw in the non-superposition case. The SAE learns clean latents for features 3,4,5,7 and 8, but not feature 6.
What about SAEs trained on real LLMs?
We have so far struggled to get this technique to work well on real LLM SAEs. We suspect this is due to the activation peaks in real SAEs not being clearly separable like they are in our toy example, or there being too many features absorbed into latents. If the activation peaks between the main feature a latent is trying to track and absorbed features are overlapping, it's not obvious how to decide which activations to penalize. We likely need a smarter way to model the activation peaks in real SAE latent activation histograms, possibly via clustering, or a mixture of gaussians model. It's also not obvious that in real models, the highest-activating peak is actually the main feature we want the latent to track when there are multiple visible peaks.
Extreme Narrowness and Matryoshka SAEs
So far our experiments with narrow SAEs still have the SAE needing to represent both the parent feature and the child features in the same SAE. What if we make the SAE so narrow that only the parent feature can be represented? Surely, such an SAE would perfectly reconstruct the parent feature without any interference from child features?
This is the idea behind Matryoshka SAEs[2][1]. In a Matryoshka SAE, the SAE needs to reconstruct the input using subsets of latents of increasing size. This allows the narrower SAE sizes to represent parent features, hopefully without any broken latents, and then latents in the larger nesting size of the Matryoshka SAE can perfectly represent child features.
Co-occurrence breaks single-latent SAEs
We test the hypothesis that a narrow SAE will perfectly learn parent features by training a 1-latent SAE in a toy setting with 4 true features in a parent-child relationship. In our toy model, feature 0 is the parent feature, and features 1 and 2 are child features. Feature 3 fires independently. Feature 0 fires with probability 0.3, and features 1 and 2 both fire with probability 0.4 if feature 0 is active. Feature 3 fires with probability 0.2. All features fire with magnitude 1.0.
We begin by training a single-latent untied SAE on this setup. We hope this SAE's single latent will perfectly represent our parent feature, feature 0.
Sadly, we see our assumption is incorrect. The SAE does represent feature 0 in its single latent, but it also merges in the child features 1 and 2. Feature 3, the independent feature, is fully excluded. Interestingly, the encoder of the untied SAE is nearly identical to the decoder, so the pattern is indeed different than our original absorption pattern for untied SAEs where the encoder for a parent feature had a negative rather than positive cos sim with child features. While this is not technically absorption, this is still a broken latent.
Next let's try the same experiment using a tied SAE.
The tied SAE learns a nearly identical representation to the untied SAE. Both of these SAEs learn a single broken latent rather than correctly learning the parent feature.
Solving this using our activation orthogonality technique will require tweaking the technique to the do inverse of what we did previously and project out the high activating peaks instead of the low activating peaks. This requires modeling each peak location, and is thus out of scope for this toy example, but is left for future work.
What Does This Mean for Matryoshka SAEs?
The base assumption underlying why Matroyshka SAEs should solve absorption is not strictly true. That is, it is not true that a narrow SAE will perfectly represent a parent feature from a parent-child relationship. Instead we see that a narrow SAE will learn a broken latent merging the parent and child features together instead. While this isn't technically feature absorption by our original definition, it's also not learning a correct representation of the underlying parent feature.
This doesn't mean that Matryoshka SAEs are not useful, but we should be cautious about assuming the latents in a Matryoshka SAE are tracking true features in spite of feature co-occurrence. It's also possible that under different assumptions about parent/child feature firing probabilities and magnitudes this problem may be less severe. For instance, if the parent feature fires much more frequently on its own than it does with any given child feature, this problem is likely to be less severe. It could be possible that in LLMs, underlying parent/child features follow this pattern, but it's hard to say anything with certainty about true features in LLMs.
We may be able to combine Matryoshka SAEs with variations on our activation orthogonality technique to project out the child features from Matryoshka parent latents, for example. It's possible that using a different loss term from MSE loss might fix this problem. Regardless, we do still feel that techniques that can include a concept of hierarchy in the SAE architecture like Matryoshka SAEs are an exciting direction worth pursuing further.
For more discussion of this issue in Matryoshka SAEs, see this comment on Noa Nabeshima's Matryoshka SAEs post and this colab notebook.
Conclusion
In this post, we've looked at SAEs in more toy settings, examining tied and untied SAEs in scenarios where the SAE is both too wide and too narrow for the amount of true features. Tied SAEs appear to be more resilient to learning broken latents than untied SAEs, but tied SAEs still learn broken latents under feature co-occurrence when the SAE is more narrow than the number of true features. Sadly, this scenario is almost certainly the scenario we're in when we train SAEs on LLMs.
The toy settings in this post are not mathematical proofs, and it is very possible for our conclusions about tied SAEs to not hold under all possible toy settings of feature co-occurrence. That being said, proving what various SAE architectures will learn under what assumptions about underlying true features would be an exciting direction for future research.
In this work, we also present a possible path forward for solving broken latents due to feature co-occurrence based on the observation that broken latents in tied SAEs correspond to multiple peaks in the activation histogram of affected latents. We have so far struggled to operationalize this insight into an absorption resistant SAE trained on a real LLM, and suspect this is due to activation distributions in LLM SAEs having overlapping distributions. We plan to continue investigating whether being smarter about clustering activations in latent activation histograms could help solve this.
We also investigated one of the core assumptions of Matryoshka SAEs, and showed that in general SAEs will learn broken latents even if the SAE is too narrow to represent child features. We do not feel this should discount Matryoshka SAEs, and feel hierarchical SAEs are an exciting direction in general, but we should not expect them to be a perfect solution to feature absorption in their current form.
We hope as well that these toy models can help build intuition for what SAEs may learn and when they might go astray.
Learning Multi-Level Features with Matryoshka SAEs [link]
Bart Bussman and Patrick Leask and Neel Nanda, 2024, Lesswrong
Matryoshka Sparse Autoencoders [link]
Noa Nabeshima, 2024, Lesswrong
Sparse Autoencoders Find Highly Interpretable Features in Language Models [link]
Hoagy Cunningham and Aidan Ewart and Logan Riggs and Robert Huben and Lee Sharkey, 2023, arXiv:2309.08600
Planning in a recurrent neural network that plays Sokoban [link]
Mohammad Taufeeque and Philip Quirke and Maximilian Li and Chris Cundy and Aaron David Tucker and Adam Gleave and Adrià Garriga-Alonso, 2024, arXiv:2407.15421
Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models [link]
Adam Karvonen, 2024, arXiv:2403.15498
Sparse autoencoders find composed features in small toy models [link]
Evan Anders and Clement Neo and Jason Hoelscher-Obermaier and Jessica N. Howard, 2024, Lesswrong
Do sparse autoencoders find "true features"? [link]
Demian Till, 2024, Lesswrong