Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.
Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!
Might be worth putting a short notice at the top of each post saying that, with a link to this post or whatever other resource you'd now recommend? (inspired by the 'Attention - this is a historical document' on e.g. this PEP)
Fair point, I've been procrastinating on putting out an updated version (and don't have anything else I back enough to want to recommend in it's place - I haven't read this post closely enough yet), but adding that note to the top seems reasonable
This is a great post! Thank you for writing this up :)
On training SAEs on ConvNets - I recently trained SAEs for all layers of InceptionV1. I've written up a paper on some of the findings of early vision, with a specific focus on curve detectors (twitter thread on the paper and another on some branch specialisation related findings). The features look really good across the entire model, including finding interpretable, monosemantic features in the final layer which, to the best of my knowledge, hasn't been done before, which is really exciting! I'm hoping to put out a blog post focusing on on the final layer in the next couple of weeks (including circuit analysis between the last few layers).
To be able to say we fully understand any real neural network is such a huge step forward for the field and it seems like with SAEs we are well-positioned to actually achieve this goal now.
Some takes on some of these research questions:
Looking for opposing feature directions in SAEs
I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.
SAE/Transcoder activation shuffling
I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.
How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?
It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
Thanks Leo, very helpful!
The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.
The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4. Section 2.1 also says you use a context length of 64 tokens. I'd be very surprised if using 131,072/64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn't be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency.
Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models?
I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!
I'm 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072/64 is substantially less efficient than 131072.
We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.
When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
Thanks for prediction. Perhaps I'm underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn't care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you'd get a bigger efficiency hit if you didn't shuffle when you could have (assuming fixed batch size).
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
Cross layer superposition
Had a bit of time to think about this. Ultimately because superposition as we know it is a property of the latent space rather than the neurons in the layer, it's not clear to me that this is the question to be asking. How do you imagine an experimental result would look like?
Toy example of what I would consider pretty clear-cut cross-layer superposition:
We have a residual MLP network. The network implements a single UAND gate (universal AND, calculating the pairwise ANDs of sparse boolean input features using only neurons), as described in Section 3 here.
However, instead of implementing this with a single MLP, the network does this using all the MLPs of all the layers in combination. Simple construction that achieves this:
Now we've made a network that computes boolean circuits in superposition, without the boolean gates living in any particular MLP. To read out the value of one of the circuit outputs before it shows up in the residual stream, you'll need to look at a direction that's a linear combination of neurons in all of the MLPs. And if you use an SAE to look at a single residual stream position in this network before the very final MLP layer, it'll probably show you a bunch of half-computed nonsense.
In a real network, the most convincing evidence to me would be a circuit involving sparse coded variables or operations that cannot be localized to any single MLP.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?
A prior that doesn't assume independence should give you a sparsity penalty that isn't a sum of independent penalties for each activation.
[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
- Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
- There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.
These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).
As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.
Edit: I expect this toy model will also permit exploring:
[Lee] Is there structure in feature splitting?
- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
- 'topology in a math context'
- 'topology in a physics context'
- 'high dimensions in a math context'
- 'high dimensions in a physics context'
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?
I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.
Why we made this list:
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
Other
Applied interpretability
Intrinsic interpretability
Understanding features (not SDL)
Theoretical foundations for interpretability
Singular-learning-theory-related
Other
Meta-research and philosophy
Engineering
Papers from our first project here and here and from our second project here.