User Comment Replies

chanind21d10

The encoder of a Sparse Autoencoder (SAE) is assumed to produce a single scalar activation given by a linear combination of the features. Under feature absorption, the encoder output is modeled as:
$z = f_{2} + (1 - δ) f_{1},$
where $δ \in [0, 1]$ is an absorption parameter. A higher $δ$ means that the contribution from $f_{1}$ is attenuated (or absorbed) into the representation.

This doesn't seem correct. The encoder output should be a function of h. We need to specify the SAE encoder and decoder mathematically I think. We need to specify ... (read more)

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind2mo20

The behavior you see in your study is fascinating as well! I wonder if using a tied SAE would force these relationships in your work to be even more obvious, since if the SAE decoder in a tied SAE tries to mix co-occurring parent/child features together it has to also mix them in the encoder and thus it should show up in the activation patterns more clearly. If an underlying feature co-occurs between two latents (e.g. a parent feature), tied SAEs don't have a good way to keep the latents themselves from firing together and thus showing up as a co-firing la... (read more)

1Matthew A. Clarke2mo

I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way. I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to 'align' them. In terms of classifying 'uncertainty' vs 'compositional' cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa. Compositionality seemed to correlate with a 'hub and spoke' type of structure (see here, top left panel: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 and https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 . We also found a cluster in layer 18 that mirrors the first example above in layer 12 of Gemma-2-2b. It has worse compostional encoding, but also a slightly less hub-like structure: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201. For ambiguity, we normally see a close to fully connected graph e.g. https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201 This is clearly not perfect, a

Matryoshka Sparse Autoencoders

chanind2mo10

Yeah I think that's right, the problem is that the SAE sees 3 very non-orthogonal inputs, and settles on something sort of between them (but skewed towards the parent). I don't know how to get the SAE to exactly learn the parent only in these scenarios - I think if we can solve that then we should be in pretty good shape.

This is all sketchy though. It doesn't feel like we have a good answer to the question "How exactly do we want the SAEs to behave in various scenarios?"

I do think the goal should be to get the SAE to learn the true underlying features, at ... (read more)

Matryoshka Sparse Autoencoders

chanind2mo10

It might also be an artifact of using MSE loss. Maybe a different loss term for reconstruction loss might not have this problem?

Matryoshka Sparse Autoencoders

chanind2mo*Ω4100

I tried digging into this some more and think I have an idea what's going on. As I understand it, the base assumption for why Matryoshka SAE should solve absorption is that a narrow SAE should perfectly reconstruct parent features in a hierarchy, so then absorption patterns can't arise between child and parent features. However, it seems like this assumption is not correct: narrow SAEs sill learn messed up latents when there's co-occurrence between parent and child features in a hierarchy, and this messes up what the Matryoshka SAE learns.

I did this invest... (read more)

2Noa Nabeshima2mo

You know I was thinking ab this-- say that there are two children and they're orthogonal to the parent and each have probability 0.4 given the parent. If you imagine the space it looks like three clusters, two with probability 0.4, norm 1.4 and one with probability 0.2 and norm 1. They all have high cosine similarity with each other. From this frame, having the parent 'include' the children directions a bit doesn't seem that inappropriate. One SAE latent setup that seems pretty reasonable is to actually have one parent latent that's like "one of these three clusters is active" and three child latents pointing to each of the three clusters. The parent latent decoder in that setup would also include a bit of the child feature directions. This is all sketchy though. It doesn't feel like we have a good answer to the question "How exactly do we want the SAEs to behave in various scenarios?"

3Noa Nabeshima2mo

This is cool! I wonder if it can be fixed. I imagine it could be improved some amount by nudging the prefix distribution, but it doesn't seem like that will solve it properly. Curious if this is a large issue in real LMs. It's frustrating that there aren't ground-truth features we have access to in language models. I think how large of a problem this is can probably be inferred from a description of the feature distribution. It'd be nice to have a better sense of what that distribution is (assuming the paradigm is correct enough).

Matryoshka Sparse Autoencoders

chanind3moΩ380

Awesome work with this! Definitely looks like a big improvement over standard SAEs for absorption. Some questions/thoughts:

In the decoder cos sim plot, it looks like there's still some slight mixing of features in co-occurring latent groups including some slight negative cos sim, although definitely a lot better than in the standard SAE. Given the underlying features are orthogonal, I'm curious why the Matryoshka SAE doesn't fully drive this to 0 and perfectly recover the underlying true features? Is it due to the sampling, so there's still some chance for... (read more)

5Noa Nabeshima3mo

Even with all possible prefixes included in every batch the toy model learns the same small mixing between parent and children (this was best out of 2, for the first run the matryoshka didn't represent one of the features): https://sparselatents.com/matryoshka_toy_all_prefixes.png Here's a hypothesis that could explain most of this mixing. If the hypothesis is true, then even if every possible prefix is included in every batch, there will still be mixing. Hypothesis: This could explain these weird properties of the heatmap: - Parent decoder vector has small positive cosine similarity with child features - Child decoder vectors have small negative cosine similarity with other child features Still unexplained by this hypothesis: - Child decoder vectors have very small negative cosine similarity with the parent feature.

Toy Models of Feature Absorption in SAEs

chanind5mo10

Thank you for sharing this! I clearly didn't read the original "Towards Monsemanticity" closely enough! It seems like the main argument is that when the weights are untied, the encoder and decoder learn different vectors, thus this is evidence that the encoder and decoder should be untied. But this is consistent with the feature absorption work - we see the encoder and decoder learning different things, but that's not because the SAE is learning better representations but instead because the SAE is finding degenerate solutions which increase sparsity.

Are t... (read more)

3CallumMcDougall5mo

I don't know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights: I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there's often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I'm keen for more work like this trying to understand feature absorption better in more tractible cases

Toy Models of Feature Absorption in SAEs

chanind5mo*30

I'm not as familiar with the history of SAEs - were tied weights used in the past, but then abandoned due to resulting in lower sparsity? If that sparsity is gained by creating feature absorption, then it's not a good thing since absorption does lead to higher sparsity but worse interpretability. I'm uncomfortable with the idea that higher sparsity is always better since the model might just have some underlying features its tracking that are dense, and IMO the goal should be to recover the model's "true" features, if such a thing can be said to exist, rat... (read more)

2K. Uhlig5mo

Originally they were tied (because it makes intuitive sense), but I believe Anthropic was the first to suggest untying them, and found that this helped it differentiate similar features: That post also includes a summary of Neel Nanda's replication of the experiments, and they provided an additional interpretation of this that I think is interesting.

Toy Models of Feature Absorption in SAEs

chanind5mo30

That's an interesting idea! That might help if training a new SAE with tied encoder/decoder (or some loss which encourages the same thing) isn't an option. It seems like with absorption you're still going to get mixes of of multiple features in the decoder, and a mix of the correct feature and the negative of excluded features in the encoder, which isn't ideal. Still, it's a good question whether it's possible to take a trained SAE with absorption and somehow identify the absorption and remove it or mitigate it rather than training from scratch. It would a... (read more)

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind5mo50

Also worth noting, in the paper we only classify something as "absorption" if the main latent fully doesn't fire. We also saw cases which I would call "partial absorption" where the main latent fires, but weakly, and both the absorbing latent and the main latent have positive cosine sim with the probe direction, and both have ablation effect on the spelling task.

Another intuition I have is that when the SAE absorbs a dense feature like "starts with S" into a sparse latent like "snake", it loses the ability to adjust the relative levels of the various compo... (read more)

[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

chanind5mo80

My take is that I'd expect to see absorption happen any time there's a dense feature that co-occurs with more sparse features. So for example things like parts of speech, where you could have a "noun" latent, and things that are nouns (e.g. "dogs", "cats", etc...) would probably show this as well. If there's co-occurrence, then the SAE can maximize sparsity by folding some of the dense feature into the sparse features. This is something that would need to be validated experimentally though.

It's also problematic that it's hard to know where this will happen... (read more)

1eggsyntax5mo

Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with 'starts with _' as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!

Implementing activation steering

chanind1y10

I'd also like to humbly submit the Steering Vectors Python library to the list as well. We built this library on Pytorch hooks, similar to Baukit, but with the goal that it should work automatically out-of-the-box on any LLM on huggingface. It's different from some of the other libraries in that regard, since it doesn't need a special wrapper class, but works directly with a Huggingface model/tokenizer. It's also more narrowly focused on steering vectors than some of the other libraries.

LESSWRONG
LW

All of chanind's Comments + Replies