Demian Till — LessWrong

LESSWRONG
LW

Replying toTakeaways From Our Recent Work on SAE Probing

Takeaways From Our Recent Work on SAE Probing

Even just for evaluating the utility of SAEs for supervised probing though, I think it's unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the 'quiver of arrows' were expanded to include each method applied at each of a range of layers.

Replying toTakeaways From Our Recent Work on SAE Probing

Demian Till8mo

Takeaways From Our Recent Work on SAE Probing

Suppose we had a hypothetical 'ideal' SAE which exhaustively discovered all of the features represented by a model at a certain layer in their most 'atomic' form. Each latent's decoder direction is perfectly aligned with its respective feature direction. Zero reconstruction error, with all latents having clear, interpretable meaning. If we had such an SAE for each component of a model at each layer this would obviously be extremely valuable since we could use them to do circuit analysis and basically understand how the model works. Sure it might still be painstaking and maybe we'd wish that some of the features weren't so atomic or something, but basically we'd be in a... (read more)

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

chanind

chanind, Demian Till

Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post.

In a previous post on Toy Models of Feature Absorption, we showed that tied SAEs seem to solve feature absorption. However, when we tried to training some tied SAEs on Gemma 2 2b, these still appeared to suffer from absorption effects (or something similar). In this post, we explore how this is possible by extending our investigation to toy settings where the SAE has more or fewer latents than true features. We hope this will build intuition for how SAEs work and what sorts of failure modes they have. Some key takeaways:

Tied SAEs fail to

... (read 4362 more words →)

Replying toSparse autoencoders find composed features in small toy models

Demian Till2y

Sparse autoencoders find composed features in small toy models

Regarding some features not being learnt at all, I was anticipating this might happen when some features activate much more rarely than others, potentially incentivising SAEs to learn more common combinations instead of some of the rarer features. In order to potentially see this we'd need to experiment with more variations as mentioned in my other comment

Replying toSparse autoencoders find composed features in small toy models

Demian Till2y

Sparse autoencoders find composed features in small toy models

Nice work! I was actually planning on doing something along these lines and still have some things I'd like to try.

Interestingly your SAEs appear to be generally failing to even find optimal solutions w.r.t the training objective. For example in your first experiment with perfectly correlated features I think the optimal solution in terms of reconstruction loss and L1 loss combined (regardless of the choice of the L1 loss weighting) would have the learnt feature directions (decoder weights) pointing perfectly diagonally. It looks like very few of your hyperparameter combinations even came close to this solution.

My post was concerned primarily with the training objective being misaligned with what we really want, but... (read more)

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Nice, that's promising! It would also be interesting to see how those peaks are affected when you retrain the SAE both on the same target model and on different target models.

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Thanks, that's very interesting!

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Testing it with Pythia-70M and few enough features to permit the naive calculation sounds like a great approach to start with.

Closest neighbour rather than average over all sounds sensible. I'm not certain what you mean by unique vs non-unique. If you're referring to situations where there may be several equally close closest neighbours then I think we can just take the mean cos-sim of those neighbours, so they all impact on the loss but the magnitude of the loss stays within the normal range.

Only on features that activate also sounds sensible, but the decoder weights of neurons that didn't activate would need to be allowed to update if they were the closest... (read more)

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Thanks for clarifying! Indeed the encoder weights here would be orthogonal. But I'm suggesting applying the orthogonality regularisation to the decoder weights which would not be orthogonal in this case.

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Thanks, I mentioned this as a potential way forward for tackling quadratic complexity in my edit at the end of the post.

Replying toDo sparse autoencoders find "true features"?

Demian Till2y

Do sparse autoencoders find "true features"?

Regarding achieving perfect reconstruction and perfect sparsity in the limit, I was also thinking along those lines i.e. in the limit you could have a single neuron in the sparse layer for every possible input direction. However please correct me if I’m wrong but assuming the SAE has only one hidden layer then I don't think you could prevent neurons from activating for nearby input directions (unless all input directions had equal magnitude), so you'd end up with many neurons activating for any given input and thus imperfect sparsity.

Otherwise mostly agreed. Though as discussed, as well as making it necessary to figure out how to break apart feature combinations (as you said), feature splitting would also seem to incur the risk of less common “true features” not being represented even within combinations so those would get missed entirely.

Do sparse autoencoders find "true features"?

Demian Till

Thanks to Joseph Bloom and James Oldfield for giving feedback on drafts which helped improve the post

In this post I'll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I'll cover the following:

I'll argue that the L1 regularisation used to promote sparsity when training SAEs may cause neurons in the sparse layer to learn to represent common combinations of features rather than the individual features that we want them to discover
As well as making it more difficult to understand what the actual latent features are, I'll also argue that this limitation

... (read 3195 more words →)