Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that's what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?
Honestly, I also feel fairly confused by this - mechanistic questions are just so interesting. Empirically, I've fairly rarely found academic interpretability that interesting or useful, though I haven't read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).
To be clear, I am focusing on mechanistic interpretability - actually reverse engineering the underlying algorithms learned by a model - and I think there's legitimate and serious work to be done in other areas that could reasonably be called interpretability.
My take would roughly be that there's a few factors - but again, I'm fairly confused by this and "it's actually great and I'm just being a chauvinist" is also a pretty coherent explanation (and I know some alignment researchers who'd argue for the latter hypothesis):
I think ease of publishing is a pretty important point, even if an academic doesn't personally care about publications, often their collaborators/students/supervisors might, and there's strong career incentives to care. Eg, if a PhD student wants to work on this, and it'd be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student's career interests.
Though hopefully it's getting easier to publish in nowadays! There's a few mechanistic papers submitted to ICLR.
I mean, personally I'd say it's the only hope we have of making any of the reflection algorithms have any shot at working. you can't do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network's innards aren't readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it's hard to move the data around to clarify.
of course, this doesn't help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is "self improving") can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)
and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn't happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they're imagining wouldn't have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.
wow this really got me on a train of thinking, I'm going to post more rambling to my shortform.
Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.
a decent survey paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic field of interpretability
A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify it as 'the rest of interpretability'.
Do you see a clear distinction between mechanistic interpretability methods vs the methods reviewed in this paper? If so, what's the distinction?
This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.
This post is out of date, see v2 here
Introduction
This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.
This is aimed at people trying to get into the field of mechanistic interpretability (especially Large Language Model (LLM) interpretability). I’m writing it because I’ve benefited a lot by hearing the unfiltered and honest opinions from other researchers, especially when first learning about something, and I think it’s valuable to make this kind of thing public! On the flipside though, this post is explicitly about my personal opinions - I think some of these takes are controversial and other people in the field would disagree.
The four top level sections are priority ordered, but papers within each section are ordered arbitrarily - follow your curiosity
Priority 1: What is Mechanistic Interpretability?
Circuits: Zoom In
Sets out the circuits research agenda, and is a whirlwind overview of progress in image circuits
This is reasonably short and conceptual (rather than technical) and in my opinion very important, so I recommend deeply engaging with all of it, rather than skimming.
The core thing to take away from it is the perspective of networks having legible(-ish) internal representations of features, and that these may be connected up into interpretable circuits. The key is that this is a mindset for thinking about networks in general, and all the discussion of image circuits is just grounding in concrete examples.
In my opinion, the circuits agenda is pretty deeply at the core of what mechanistic interpretability is. It’s built on the assumption that there is some legible, interpretable structure inside neural networks, if we can just figure out how to reverse engineer it. And the core goal of the field is to find what circuits we can, build better tools for doing so, and do the fundamental science of figuring out which of the claims about circuits are actually true, which ones break, and whether we can fix them.
Meta: The goal of reading this is to understand what the fundamental mindset and worldview being defended here is. The goal is not necessarily to leave feeling convinced that these claims are true, or that the article adequately justifies them. That’s what the rest of the papers in here are for!
A useful thing to reflect on is what the world would look like if the claims were and were not true - what evidence could you see that might convince you either way? These are definitely not obviously true claims!
A Mathematical Framework for Transformer Circuits
The point of this is to explain how to conceptually break down a transformer into individually understandable pieces.
Deeply engage with:
All the ideas in the overview section, especially:
Understanding the residual stream and why it’s fundamental.
The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations
And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent.
Skip Trigrams & Skip Trigram bugs, esp understanding why these are a really easy thing to do with attention, and how the bugs are inherent to attention heads separating where to attend to (QK) and what to do once you attend somewhere (OV)
Induction heads, esp why this is K-Composition (and how that’s different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model
Skim or skip:
Maybe check out my (long-ass) walkthrough of the paper, and comments on how I think about things
If you prefer video over reading I expect it to be high value
Either way it’s probably useful to check the relevant section it if there’s part of the paper that confuses you.
Priority 2: Understanding Key Concepts in the field
Induction Heads
This is a study of how induction heads are ubiquitous in real transformers, and form as a sudden phase change during training.
Deeply engage with:
Key concepts + argument 1.
Argument 4: induction heads also do translation + few shot learning.
Getting a rough intuition for all the methods used in the Model Analysis Table, as a good overview of interesting interpretability techniques.
Skim or skip:
All the rigour - basically everything I didn’t mention. The paper goes way overboard on rigour and it’s not worth understanding every last detail
A particularly striking result is that induction heads form at ~the same time in all models - I think this is very cool, but somewhat overblown - from some preliminary experiments, I think it’s pretty sensitive to learning rate and positional encoding (though the fact that it doesn’t depend on scale is fascinating!)
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
Short-ish conceptual essay on what the point of mechanistic interpretability is and how to think about it.
This is similar in flavour to Circuits: Zoom In, but is more conceptual and less grounded in very concrete examples + progress - your mileage may vary in how much this works for you.
A Toy Model of Superposition
Building a simple toy model that contains superposition, and analysing it in detail.
Deeply engage with:
The core intuitions: what is superposition, how does it respond to feature importance and sparsity, and how does it respond to correlated and uncorrelated features.
Read the strategic picture, and sections 1 and 2 closely.
Skim or skip:
A good intro paper for concrete projects. The models are tiny, the core results should be easy to replicate (and have short training times), there’s an accompanying Colab and a list of follow-up ideas, so this is a great paper to play around with!
Curve Detectors & Curve Circuits (Image interpretability)
An extremely detailed and rigorous study of a family of neurons in Inception; a gold standard of what good interpretability can look like. Culminates in them hand-coding the weights of artificial neurons and substituting those into the circuit, and comparing performance. Note that a bunch of the techniques won’t generalise.
Deeply engage with:
Understanding what they did as a gold standard, and thinking about why what they did is deep and meaningful evidence.
Think about which techniques will and will not generalise to LLMs
Priority 3: Expanding Understanding
Language Models
Indirect Object Identification
A paper about reverse engineering a complex (28 head!) circuit in GPT-2 Small
The most detailed “we actually have a circuit, and can drill into it in detail and really get how it works” paper that I know of.
Particularly good for a vibe of “ways interpretability is hard and you can trick yourself” + “but it is actually possible and we can fix these”
SoLU
A paper on a neuron activation function that makes transformer neurons somewhat more interpretable.
Deeply engage with:
Section 3 (Background). For the core ideas, esp superposition, privileged bases and why they matter.
Section 6 (on the neurons found). For getting the vibe of what kind of features LLMs learn - I think this is the best resource I know of for getting a vibe of what kinds of things MLP layers are doing at different layers of a transformer.
Skim:
Skip:
ROME
A paper on locating and editing factual knowledge in GPT-2 - a strong contender for my favourite non Chris Olah interpretability paper
Deeply engage with:
Skim or skip:
Logit Lens
A solid early bit of work on LLM interpretability. The key insight is that we interpret the residual stream of the transformer by multiplying by the unembedding and mapping to logits, and that we can do this to the residual stream before the final layer and see the model converging on the right answer
Deeply Engage with:
Skim or skip:
Skim the figures about progress towards the answer through the model, focus on just getting a vibe for what this progress looks like.
Skip everything else.
The deeper insight of this technique (not really covered in the work) is that we can do this on any vector in the residual stream to interpret it in terms of the direct effect on the logits - including the output of an attn or MLP layer and even a head or neuron. And we can also do this on weights writing to the residual stream.
Analyzing Transformers in Embedding Space is a more recent paper that drills down into this insight, focusing on weights.
I’m somewhat meh on the paper as a whole, but sections 3, 4.1 and Appendix C are cool for seeing what head and neuron circuits can look like
Note that they make the (IMO) mistake of treating embedding and unembedding space as the same space - the input and output are different spaces! Even if most people make the mistake of setting the embed and unembed maps to be the same matrix :(
Note that this tends only to work for things close to the final layer, and will totally miss any indirect effect on the outputs (eg via composing with future layers, or suppressing incorrect answers)
An Interpretability Illusion for BERT
Good early paper on the limitations of max activating dataset examples - they took a seemingly interpretable neuron in BERT and took the max activating dataset examples on different datasets, and observed consistent patterns within a dataset, but very different examples between datasets
Within the lens of the Toy Model paper, this makes sense! Features correspond to directions in the residual stream that probably aren’t neuron aligned. Max activating dataset examples will pick up on the features most aligned with that neuron. Different datasets have different feature distributions and will give different “most aligned feature”
Deeply engage with:
The concrete result that the same neuron can have very different max activating dataset examples
The meta-level result that a naively compelling interpretability technique can be super misleading on closer inspection
Skim or skip:
Algorithmic Tasks
A Mechanistic Interpretability Analysis of Grokking
Conflict of interest note - I was the main person working on this project!
A very detailed reverse engineering of a tiny model trained to do modular addition and interpreting it during training, plus a bunch of discussion on phase changes, an (attempted) explanation of grokking and showing grokking on other tasks.
Grokking probably isn’t that relevant to real models and the techniques don’t really generalise, but a good example of detailed reverse engineering + fully understanding a model on an algorithmic task, and of applying interpretability during training.
I also just personally think this project was super fucking cool, even if not that useful.
Deeply engage with:
The key claims and takeaways sections
Overview of the modular addition algorithm
Skim:
Reverse engineering modular addition - understanding the different types of evidence and how they fit together
Evolution of modular addition circuits during training - the flavour of what the circuits developing looks like during training, and the fact that once we understand things, we can just literally watch them develop!
The Phase Changes section - probably the most interesting bits are the explanation of grokking, and the two speculative hypotheses.
Maybe a good intro paper to replicate! It has an accompanying colab and a list of future directions at the end
Image Circuits
Feature Vis (fairly short)
Multimodal Neurons in Artificial Neural Networks
An analysis of neurons in a text + image model (CLIP), finding a bunch of abstract + cool neurons. Not a high priority to deeply engage with, but very cool and worth skimming.
My key takeaways
There are so many fascinating neurons! Like, what?
The intuition that multi-modal models (or at least, models that use language) are incentivised to represent things in a conceptual way, rather than specifically tied to the input format
The detailed analysis of the Donald Trump neuron, esp that it is more than just a “activates on Donald Trump” neuron, and instead activates for many different clusters of things, roughly tracking their association with Donald Trump.
The “adversarial attacks by writing Ipod on an apple” part isn’t very deep, but is hilarious
The rest of the circuits thread
A lot of really cool ideas and scattered threads! Worth skimming and digging into anything that catches your interest. Each individual article is short-ish
This thread represents, in my opinion, the first serious attempt at reverse engineering a real model (inception)
My personal favourites:
An Overview of Early Vision Neurons - it’s just fascinating to see the weird shit that happens, super cool to the hierarchy where see simple shapes are in early layers and are built into more abstract shapes in layer layers, and to see neurons being sorted into families
Visualising weights - somewhat image specific, but a fascinating exploration of the data visualisation questions underlying mechanistic interpretability - visualisations are super useful, but how can we do them in a properly principled way, and how can they mislead?
Branch Specialisation - networks spontaneously learn to be modular and the modules seem to be consistent and semantically meaningful?! WTF?
Priority 4: Bonus
Not a paper: The codebase of EasyTransformer, a transformer mechanistic interpretability I’m writing - I think it’s worth reading for a fairly clean and conceptual-focused implementation of a transformer, specifically reading EasyTransformer.forward and components.py (a file for the various layers) (the actual codebase is pretty long!)
Everything else Chris Olah has ever written
I’m somewhat biased on this, but I think Chris is just clearly far and away the best interpretability researcher in the world.
He’s also a massive nerd for good technical communication, interactivity and good graphic design, and I find his work a joy to read.
Interpreting RL Vision
Interesting application of image circuits techniques to get some insight into an RL model - unclear how much it generalises/works
The parts about the impact of the amount of and diversity of data on interpretability feel most interesting and general to me.
Probably the best RL mechanistic interpretability paper I know of (but it’s a pretty low bar :( )
Not a paper: Playing around with OpenAI Microscope - visualizations and top dataset examples of every neuron in a ton of image models! Challenge: What’s the weirdest neuron you can find?
Visualizing and Interpreting the Geometry of BERT (+ blog post)
An early LLM interpretability paper about understanding how BERT represents language in the residual stream.
Deeply engage with:
Skim or skip:
Acquisition of Chess Knowledge in AlphaZero - analysing AlphaZero’s chess knowledge, including during training
Notable for the hilarious stunt of getting a chess grandmaster commenting, and for co-authoring (even if this isn’t that interpretability related)
Focuses on feature analysis rather than really mechanistic engagement, but still very cool! The main things I think are cool were successfully applying interpretability during training, and on the weird and fucky task of playing chess (and that models trained on non-image/language tasks are somewhat interpretable!).
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - a decent survey paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic field of interpretability (I rarely find insights from there useful in my work) and would prioritise reading the papers in the previous sections, but it’s worth skimming to get a sense for what’s out there, and digging into anything relevant to a specific project you’re pursuing!
A Primer in BERTOLOGY - a survey paper specifically on BERTology, a subfield about specifically interpreting BERT. I feel pretty meh about this, but am not very familiar with the field.
The Building Blocks of Interpretability
Not a paper, but I find Chris Olah’s interview on the 80,000 Hours podcast super inspiring