User Comment Replies

Toy Models of Feature Absorption in SAEs

A hacky solution might be to look at the top activations using encoder directions AND decoder directions. We can think of the encoder as giving a "specific" meaning and the decoder a "broad" meaning, potentially overlapping other latents. Discrepancies between the two sets of top activations would indicate absorption.

Untied encoders give sparser activations by effectively removing activations that can be better attributed to other latents. So an encoder direction’s top activations can only be understood in the context of all the other latents.

Top activa... (read more)

3chanind9mo

That's an interesting idea! That might help if training a new SAE with tied encoder/decoder (or some loss which encourages the same thing) isn't an option. It seems like with absorption you're still going to get mixes of of multiple features in the decoder, and a mix of the correct feature and the negative of excluded features in the encoder, which isn't ideal. Still, it's a good question whether it's possible to take a trained SAE with absorption and somehow identify the absorption and remove it or mitigate it rather than training from scratch. It would also be really interesting if we could find a way to detect absorption and use that as a way to quantify the underlying feature co-occurrences somehow. I think you're correct that tying the encoder and decoder will mean that the SAE won't be as sparse. But then, maybe the underlying features we're trying to reconstruct are themselves not necessarily all sparse, so that could potentially be OK. E.g. things like "noun", "verb", "is alphanumeric", etc... are all things the model certainly knows, but would be dense if tracked in a SAE. The true test will be to try training some real tied SAEs and seeing how interpretable the results look like.

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Michael Pearce10mo*20

On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.

In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature's activations that is then used to compute the en... (read more)

1Jacob Dunefsky9mo

Yep, that's completely true. Thanks for the reminder!

Tokenized SAEs: Infusing per-token biases.

Michael Pearce11mo30

This work is really interesting. It makes sense that if you already have a class of likely features with known triggers, such as the unigrams, having a lookup table or embeddings for them will save in compute, since you don't need to learn the encoder.

I wonder if this approach could be extended beyond tokens. For example, if we have residual stream features from an upstream SAE does it make sense to use those features for the lookup table in a downstream SAE. The vectors in the table might be the downstream representation of the same feature (with updates from the intermediate layers). Using features from an early layer SAE might capture the effective tokens that form by combining common bigrams and trigrams.

4tdooms11mo

Tokens are indeed only a specific instantiation of hardcoding "known" features into an SAE, there are lots of interesting sparse features one can consider which may even further speed up training. I like the suggestion of trying to find the "enriched" token representations. While our work shows that such representations are likely bigrams and trigrams, using an extremely sparse SAE to reveal those could also work (say at layer 1 or 2). While this approach still has the drawback of having an encoder, this encoder can be shared across SAEs, which is still a large decrease in complexity. Also, the encoder will probably be simpler since it's earlier in the model. This idea can be implemented recursively across a suite of SAEs, where each layer can add to a pool of hardcoded features. In other words, each layer SAE in a layer has its own encoder/decoder and the decoder is copied (and fine-tuned) across later layers. This would allow to more faithfully trace a feature through the model than is currently possible.

You’re Measuring Model Complexity Wrong

Michael Pearce2y10

The characterization of basin dimension here is super interesting. But it sounds like most of the framing is in terms of local minima. My understanding is that saddle points are much more likely in high dimensional landscapes (eg, see https://arxiv.org/abs/1406.2572) since there is likely always some direction leading to smaller loss.

How does your model complexity measure work for saddle points? The quotes below suggest there could be issues, although I imagine the measure makes sense as long as the weights are sampled around the saddle ... (read more)

1Daniel Murfet1y

This is an open question. In practice it seems to work fine even at strict saddles (i.e. things where there are no negative eigenvalues in the Hessian but there are still negative directions, i.e. they show up at higher than second order in the Taylor series), in the sense that you can get sensible estimates and they indicate something about the way structure is developing, but the theory hasn't caught up yet.

LESSWRONG
LW

All of Michael Pearce's Comments + Replies