I think people who read A Mathematical Framework should note that its mathematical claim about one-layer transformers being equivalent to skip-trigrams are IMO wrong and many people interpret the induction head hypothesis as being much stronger than evidence supports.
(I think that many other claims in the paper are pretty dubious, e.g. the stuff about interpreting models as sums of paths is IMO dubious because there is a softmax nonlinearity after these paths, but I have never gotten around to writing this up and probably never will.)
Fair point, I'll add that in to the post. The main reason I recommend it so highly and prominently is that I think it builds valuable conceptual frameworks for reasoning about the pieces of a transformer, even if it somewhat overclaims on how far it can get on interpreting tiny attention-only models, and I think those broad intuitions still stand even after your critiques. Eg strict induction heads as an example of the kind of algorithm that can be implemented with attention, even if it's not fully faithful to the underlying model. But I agree that these are worthwhile caveats to have in mind when reading, and the paper shouldn't be blindly recommended.
Thanks! I agree that thinking through the idealized induction head algorithm seems healthy, but I think it seems important to know that that algorithm isn’t much of what those heads are actually doing!
Great list! Would you consider
"The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks"
https://arxiv.org/abs/2306.17844
a candidate for "important work in mech interp [which] has properly built on [Progress Measures.]" ?
Are you aware of any problems with it?
I'm not aware of any problems with it. I think it's a nice paper, but not really at my bar for important work (which is a really high bar, to be clear - fewer than half the papers in this post probably meet it)
Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there's just a ton of gorgeous results in there. I think it's the most (only?) truly rigorous reverse-engineering work out there
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re "most (only?) truly rigorous reverse-engineering work out there": I think the clock and pizza paper seems comparably rigorous, and there's also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe's heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).
Thanks! That was copied from the previous post, and Ithink this is fair pushback, so I've hedged the claim to "one of the most", does that seem reasonable?
I haven't deeply engaged enough with those three papers to know if they meet my bar for recommendation, so I've instead linked to your comment from the post
I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both?
Is there any application for one that can't be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.
Imo they're just completely different techniques, which aren't really comparable. Activation patching is about understanding the difference between two activations by patching one to replace the other and seeing what happens. SAEs are a technique for decomposing an activation into interpretable pieces
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Thanks Neel, keep this coming - even if only once every few years :) You helped me clarify lots of confusion I had about the existing techniques.
I am a huge fan of steering vectors / control vectors, and I would love to see future research showing if they can be linearly combined together to achieve multiple behaviours simultaneously (I made a post about this). I don't think it's just "internal work" - I think it's a hint to the fact that language semantics can be linearised as vector spaces (I hope I will be able to formalise mathematically this intuition).
Here a proposal of a possible ELK solution using that approach.
Glad you liked the post!
I'm also pretty interested in combining steering vectors. I think a particularly promising direction is using SAE decoder vectors for this, as SAEs are designed to find feature vectors that independently vary and can be added.
I agree steering vectors are important as evidence for the linear representation hypothesis (though at this point I consider SAEs to be much superior as evidence, and think they're more interesting to focus on)
This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two years ago
There’s a lot of mechanistic interpretability papers, and more come out all the time. This can be pretty intimidating if you’re new to the field! To try helping out, here's a reading list of my favourite mech interp papers: papers which I think are important to be aware of, often worth skimming, and something worth reading deeply (time permitting). I’ve annotated these with my key takeaways, what I like about each paper, which bits to deeply engage with vs skim, etc. I wrote a similar post 2 years ago, but a lot has changed since then, thus v2!
Note that this is not trying to be a comprehensive literature review - this is my answer to “if you have limited time and want to get up to speed on the field as fast as you can, what should you do”. I’m deliberately not following academic norms like necessarily citing the first paper introducing something, or all papers doing some work, and am massively biased towards recent work that is more relevant to the cutting edge. I also shamelessly recommend a bunch of my own work here, and probably haven't always clearly indicated which papers I was involved in, sorry!
How to read this post: I've bolded the most important papers to read, which I recommend prioritising. All of the papers are annotated with my interpretation and key takeaways, and tbh I think reading that may be comparable good to skimming the paper. And there's far too many papers to read all of them deeply unless you want to make that a significant priority. I recommend reading all my summaries, noting the papers and areas that excite you, and then trying to dive deeply into those.
Foundational Work
A Mathematical Framework for Transformer Circuits (Nelson Elhage et al, Anthropic) - absolute classic, foundational ideas for how to think about transformers. See my youtube tutorial (I hear this is best watched after reading the paper, and adds additional clarity).
Superposition
Superposition is a core principle/problem in model internals. For any given activation (eg the output of MLP13), we believe that there’s a massive dictionary of concepts/features the model knows of. Each feature has a corresponding vector, and model activations are a sparse linear combination of these meaningful feature vectors. Further, there are more features in the dictionary than activation dimensions, and they are thus compressed in and interfere with each other, essentially causing cascading errors. This phenomena of compression is called superposition.
Sparse Autoencoders
SAEs are a tool to interpret model activations in superposition - they’re a one hidden layer ReLU autoencoder (basically a transformer MLP layer), and are trained to reconstruct a model’s activations. L1 regularisation is applied to make the hidden layer activations sparse. Though not directly trained to be interpretable, the hope is that each unit (or feature) corresponds to an interpretable feature. The encoder + ReLU learns the sparse feature coefficients, and the decoder is a dictionary of feature vectors. Empirically, it seems to work, and I think they’re the one of the most promising tools in mech interp right now.
To understand the actual technique I recommend this ARENA tutorial, sections 1, 6 & 7 (exposition + code >> papers), but here are some related papers worth understanding. Note that all of these came out in the past year, this is very much where a lot of the mech interp frontier is at! However, our understanding of them is still highly limited, and there are many uncertainties and open problems remaining, and I expect our understanding and best practices to be substantially different in a year or two.
Activation Patching
Activation patching (aka causal mediation analysis aka interchange interventions aka causal tracing aka resample ablations - argh why can't we agree on names for things!) is a core mech interp technique, worth understanding in a lot of detail. The key idea is that, for a given model behaviour, only a sparse set of components (heads and neurons) are likely relevant. We want to localise these components with causal interventions. But, given any prompt, many model behaviours go into it. For example, if we want to know where the knowledge that Michael Jordan plays basketball lives, this is hard - we can do things like deleting components and seeing if the model still says basketball, but maybe we deleted the "this is about sports" part or the "I am speaking English part".
The key idea is to find contrast pairs - prompts which are as similar as possible, apart from the behaviour we care about, eg "Michael Jordan plays the sport of" and "Babe Ruth plays the sport of". If we patch activations from the Jordan token into the Ruth token (or vice versa), we control for things like "this is about sports" but change whether the sport is basketball or not, letting us surgically localise the right behaviour.
To understand the actual technique I recommend this ARENA tutorial (exposition + code >> papers), but here are some related papers worth understanding.
Narrow Circuits
A particularly important application of activation patching is finding narrow circuits - for some specific task, like answering multiple choice questions, which model components are crucial for performance on that task? Note that these components may do many other things too in other contexts, due to polysemanticity, but that is not relevant to this kind of analysis. At this point, there's a lot of narrow circuits work, but here's some of my favourites.
Note that this used to be a very popular area of mech interp work in 2023, but has fallen somewhat out of fashion. I am still excited to see work doing narrow circuits work in the SAE basis (eg Sparse Feature Circuits above), work using narrow circuits to understand or do something on real-world tasks, especially in larger models, and work automating circuit discovery, especially automating the process of finding the meaning of the circuit. But I think that there's not that much value in more manual Indirect Object Indentification style work in small models.
Paper Back-and-Forths
One of my favourite phenomenons is when someone puts out an exciting paper, that gets a lot of attention yet has some subtle flaws, and follow-up work identifies and clarifies these. Interpretability is dark and full of terrors, and it is very easy to have a beautiful, elegant hypothesis that is completely/partially wrong, yet easily to believe by overinterpreting your evidence. Red-teaming your own work and being on guard for this is a crucial skill as a researcher, and reading examples of this in the literature is valuable training.
My answer in rot13: Gur pber ceboyrz vf gung gurl qvq snpg vafregvba, abg snpg rqvgvat. Gb rqvg fbzrguvat lbh zhfg qryrgr gur byq guvat naq vafreg arj vasbezngvba, ohg EBZR unf yvggyr vapragvir gb rqvg jura vg pbhyq whfg vafreg n ybhq arj snpg gb qebja bhg gur byq bar. Guvf zrnaf gung gur ynlre vg jnf vafregrq va qvqa'g ernyyl znggre, orpnhfr vg whfg unq gb qrgrpg gur cerfrapr bs gur Rvssry Gbjre naq bhgchg Ebzr. Shegure gurer jrer yvxryl cngubybtvpny rssrpgf yvxr bhgchggvat 'ybbx ng zr' fb urnqf jbhyq nggraq zber fgebatyl gb gur Gbjre gbxra naq guhf bhgchg Ebzr zber, guvf yrq gb pregnva ohtf yvxr "V ybir gur Rvssry Gbjre! Onenpx Bonzn jnf obea va" -> " Ebzr".
In my follow-up, I found that there was a linear world model hiding beneath! But that rather than saying whether a square was black or white, it said whether it had the current or opposing player's colour. Once you have this linear world model, you can causally intervene with simple vector arithmetic!
Bonus
I don't think the papers in here are essential reading, but are worth being aware of, and some are broadly worth reading if you have the time, especially if any specific ones catch your eye!
Thanks to Trenton Bricken and Michael Nielson for nudging me to write an updated version!