(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

Neel Nanda

This post is out of date, see v2 here

Introduction

This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.

This is aimed at people trying to get into the field of mechanistic interpretability (especially Large Language Model (LLM) interpretability). I’m writing it because I’ve benefited a lot by hearing the unfiltered and honest opinions from other researchers, especially when first learning about something, and I think it’s valuable to make this kind of thing public! On the flipside though, this post is explicitly about my personal opinions - I think some of these takes are controversial and other people in the field would disagree.

The four top level sections are priority ordered, but papers within each section are ordered arbitrarily - follow your curiosity

Priority 1: What is Mechanistic Interpretability?

Circuits: Zoom In
- Sets out the circuits research agenda, and is a whirlwind overview of progress in image circuits
- This is reasonably short and conceptual (rather than technical) and in my opinion very important, so I recommend deeply engaging with all of it, rather than skimming.
- The core thing to take away from it is the perspective of networks having legible(-ish) internal representations of features, and that these may be connected up into interpretable circuits. The key is that this is a mindset for thinking about networks in general, and all the discussion of image circuits is just grounding in concrete examples.
  - On a deeper level, understanding why these are important and non-trivial claims about neural networks, and their implications.
- In my opinion, the circuits agenda is pretty deeply at the core of what mechanistic interpretability is. It’s built on the assumption that there is some legible, interpretable structure inside neural networks, if we can just figure out how to reverse engineer it. And the core goal of the field is to find what circuits we can, build better tools for doing so, and do the fundamental science of figuring out which of the claims about circuits are actually true, which ones break, and whether we can fix them.
  - An important note is that mechanistic interpretability is an extremely young field and this was written 2.5 years ago - I take the specific claims in this article as a starting point, not as the definitive grounding of what the field must believe.
- Meta: The goal of reading this is to understand what the fundamental mindset and worldview being defended here is. The goal is not necessarily to leave feeling convinced that these claims are true, or that the article adequately justifies them. That’s what the rest of the papers in here are for!
- A useful thing to reflect on is what the world would look like if the claims were and were not true - what evidence could you see that might convince you either way? These are definitely not obviously true claims!
A Mathematical Framework for Transformer Circuits
- The point of this is to explain how to conceptually break down a transformer into individually understandable pieces.
- Deeply engage with:
  - All the ideas in the overview section, especially:
    - Understanding the residual stream and why it’s fundamental.
    - The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations
    - And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent.
  - Skip Trigrams & Skip Trigram bugs, esp understanding why these are a really easy thing to do with attention, and how the bugs are inherent to attention heads separating where to attend to (QK) and what to do once you attend somewhere (OV)
  - Induction heads, esp why this is K-Composition (and how that’s different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model
- Skim or skip:
  - Eigenvalues or tensor products. They have the worst effort per unit insight of the paper and aren’t very important.
- Maybe check out my (long-ass) walkthrough of the paper, and comments on how I think about things
  - If you prefer video over reading I expect it to be high value
  - Either way it’s probably useful to check the relevant section it if there’s part of the paper that confuses you.

Priority 2: Understanding Key Concepts in the field

Induction Heads
- This is a study of how induction heads are ubiquitous in real transformers, and form as a sudden phase change during training.
- Deeply engage with:
  - Key concepts + argument 1.
  - Argument 4: induction heads also do translation + few shot learning.
  - Getting a rough intuition for all the methods used in the Model Analysis Table, as a good overview of interesting interpretability techniques.
- Skim or skip:
  - All the rigour - basically everything I didn’t mention. The paper goes way overboard on rigour and it’s not worth understanding every last detail
    - The main value to get when skimming is an overview of different techniques, esp general techniques for interpreting during training.
- A particularly striking result is that induction heads form at ~the same time in all models - I think this is very cool, but somewhat overblown - from some preliminary experiments, I think it’s pretty sensitive to learning rate and positional encoding (though the fact that it doesn’t depend on scale is fascinating!)
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
- Short-ish conceptual essay on what the point of mechanistic interpretability is and how to think about it.
- This is similar in flavour to Circuits: Zoom In, but is more conceptual and less grounded in very concrete examples + progress - your mileage may vary in how much this works for you.
A Toy Model of Superposition
- Building a simple toy model that contains superposition, and analysing it in detail.
- Deeply engage with:
  - The core intuitions: what is superposition, how does it respond to feature importance and sparsity, and how does it respond to correlated and uncorrelated features.
  - Read the strategic picture, and sections 1 and 2 closely.
- Skim or skip:
  - No need to deeply understand the rest, it can mostly be skimmed. It’s very cool, especially the geometry and phase transition and learning dynamics part, but a bit of a nerd snipe and doesn’t obviously generalise to real models.
- A good intro paper for concrete projects. The models are tiny, the core results should be easy to replicate (and have short training times), there’s an accompanying Colab and a list of follow-up ideas, so this is a great paper to play around with!
Curve Detectors & Curve Circuits (Image interpretability)
- An extremely detailed and rigorous study of a family of neurons in Inception; a gold standard of what good interpretability can look like. Culminates in them hand-coding the weights of artificial neurons and substituting those into the circuit, and comparing performance. Note that a bunch of the techniques won’t generalise.
- Deeply engage with:
  - Understanding what they did as a gold standard, and thinking about why what they did is deep and meaningful evidence.
  - Think about which techniques will and will not generalise to LLMs

Priority 3: Expanding Understanding

Language Models

Indirect Object Identification
- A paper about reverse engineering a complex (28 head!) circuit in GPT-2 Small
  - The most detailed “we actually have a circuit, and can drill into it in detail and really get how it works” paper that I know of.
    - The circuit in question is for the task of completing “When John and Mary went to the shops, John bought a bottle of milk for” -> “ Mary” but “Mary bought a bottle of milk for” -> “ John”
- Particularly good for a vibe of “ways interpretability is hard and you can trick yourself” + “but it is actually possible and we can fix these”
SoLU
- A paper on a neuron activation function that makes transformer neurons somewhat more interpretable.
- Deeply engage with:
  - Section 3 (Background). For the core ideas, esp superposition, privileged bases and why they matter.
    - See “A Toy Model of Superposition” for much more on superposition.
  - Section 6 (on the neurons found). For getting the vibe of what kind of features LLMs learn - I think this is the best resource I know of for getting a vibe of what kinds of things MLP layers are doing at different layers of a transformer.
- Skim:
  - Section 4 (on the exact function and how it works) - the main intuition to get is why you might expect this to work (in particular, why lateral inhibition seems important)
- Skip:
  - Section 5 (showing that the model works as well as normal activation functions).
ROME
- A paper on locating and editing factual knowledge in GPT-2 - a strong contender for my favourite non Chris Olah interpretability paper
- Deeply engage with:
  - Causal tracing + activation patching stuff (including the appendix on it). It’s a really cool, elegant and general technique, and demonstrates that certain computation is extremely localised in the model, and uses careful counterfactuals to isolate this computation.
- Skim or skip:
  - The model editing stuff. It’s way less interesting from an interpretability point of view than the above.
Logit Lens
- A solid early bit of work on LLM interpretability. The key insight is that we interpret the residual stream of the transformer by multiplying by the unembedding and mapping to logits, and that we can do this to the residual stream before the final layer and see the model converging on the right answer
  - Key takeaway: Model layers iteratively update the residual stream, and the residual stream is the central object of a transformer
- Deeply Engage with:
  - The key insight of applying the unembedding early, and grokking why this is a reasonable thing to do.
- Skim or skip:
  - Skim the figures about progress towards the answer through the model, focus on just getting a vibe for what this progress looks like.
  - Skip everything else.
- The deeper insight of this technique (not really covered in the work) is that we can do this on any vector in the residual stream to interpret it in terms of the direct effect on the logits - including the output of an attn or MLP layer and even a head or neuron. And we can also do this on weights writing to the residual stream.
  - Analyzing Transformers in Embedding Space is a more recent paper that drills down into this insight, focusing on weights.
    - I’m somewhat meh on the paper as a whole, but sections 3, 4.1 and Appendix C are cool for seeing what head and neuron circuits can look like
    - Note that they make the (IMO) mistake of treating embedding and unembedding space as the same space - the input and output are different spaces! Even if most people make the mistake of setting the embed and unembed maps to be the same matrix :(
  - Note that this tends only to work for things close to the final layer, and will totally miss any indirect effect on the outputs (eg via composing with future layers, or suppressing incorrect answers)
An Interpretability Illusion for BERT
- Good early paper on the limitations of max activating dataset examples - they took a seemingly interpretable neuron in BERT and took the max activating dataset examples on different datasets, and observed consistent patterns within a dataset, but very different examples between datasets
  - Within the lens of the Toy Model paper, this makes sense! Features correspond to directions in the residual stream that probably aren’t neuron aligned. Max activating dataset examples will pick up on the features most aligned with that neuron. Different datasets have different feature distributions and will give different “most aligned feature”
    - Further, models want to minimise interference and thus will superpose anti-correlated features, so they should
- Deeply engage with:
  - The concrete result that the same neuron can have very different max activating dataset examples
  - The meta-level result that a naively compelling interpretability technique can be super misleading on closer inspection
- Skim or skip:
  - Everything else - I don’t care much about the details beyond the headline result, which is presented well in the intro.

Algorithmic Tasks

A Mechanistic Interpretability Analysis of Grokking
- Conflict of interest note - I was the main person working on this project!
- A very detailed reverse engineering of a tiny model trained to do modular addition and interpreting it during training, plus a bunch of discussion on phase changes, an (attempted) explanation of grokking and showing grokking on other tasks.
  - Grokking probably isn’t that relevant to real models and the techniques don’t really generalise, but a good example of detailed reverse engineering + fully understanding a model on an algorithmic task, and of applying interpretability during training.
    - Also a good example of how actually understanding a model can be really useful, and push forwards science of deep learning by explaining confusing phenomena.
  - I also just personally think this project was super fucking cool, even if not that useful.
- Deeply engage with:
  - The key claims and takeaways sections
  - Overview of the modular addition algorithm
    - The key vibe here is “holy shit, that’s a weird/unexpected algorithm”, but also, on reflection, a pretty natural thing to learn if you’re built on linear algebra - this is a core mindset for interpreting networks!
- Skim:
  - Reverse engineering modular addition - understanding the different types of evidence and how they fit together
  - Evolution of modular addition circuits during training - the flavour of what the circuits developing looks like during training, and the fact that once we understand things, we can just literally watch them develop!
    - The interactive graphics in the colab are way better than static images
  - The Phase Changes section - probably the most interesting bits are the explanation of grokking, and the two speculative hypotheses.
- Maybe a good intro paper to replicate! It has an accompanying colab and a list of future directions at the end

Image Circuits

Feature Vis (fairly short)
- An early paper with a really core technique for image interpretability. Doesn’t really transfer to LLMs, but worth getting the vibe, and seeing how this made image interpretability much easier and more rigorous in certain ways - the vibe that this basically automatically gives variable names to neurons.
Multimodal Neurons in Artificial Neural Networks
- An analysis of neurons in a text + image model (CLIP), finding a bunch of abstract + cool neurons. Not a high priority to deeply engage with, but very cool and worth skimming.
- My key takeaways
  - There are so many fascinating neurons! Like, what?
    - There’s a teenage neuron, a Minecraft neuron, a Hitler neuron and an incarcerated neuron?!
  - The intuition that multi-modal models (or at least, models that use language) are incentivised to represent things in a conceptual way, rather than specifically tied to the input format
  - The detailed analysis of the Donald Trump neuron, esp that it is more than just a “activates on Donald Trump” neuron, and instead activates for many different clusters of things, roughly tracking their association with Donald Trump.
    - This seems like weak evidence that neuron activations may split more into interpretable segments, rather than an interpretable directions
  - The “adversarial attacks by writing Ipod on an apple” part isn’t very deep, but is hilarious
The rest of the circuits thread
- A lot of really cool ideas and scattered threads! Worth skimming and digging into anything that catches your interest. Each individual article is short-ish
- This thread represents, in my opinion, the first serious attempt at reverse engineering a real model (inception)
- My personal favourites:
  - An Overview of Early Vision Neurons - it’s just fascinating to see the weird shit that happens, super cool to the hierarchy where see simple shapes are in early layers and are built into more abstract shapes in layer layers, and to see neurons being sorted into families
    - If you click on a neuron, you’ll see the weight explorer - this is a really fun tool to play around with, and practice just reading off the weights what they do!
  - Visualising weights - somewhat image specific, but a fascinating exploration of the data visualisation questions underlying mechanistic interpretability - visualisations are super useful, but how can we do them in a properly principled way, and how can they mislead?
    - I really want to see more papers like this! These meta questions are really important, but it’s rarely incentivised to publish on them
  - Branch Specialisation - networks spontaneously learn to be modular and the modules seem to be consistent and semantically meaningful?! WTF?

Priority 4: Bonus

Not a paper: The codebase of EasyTransformer, a transformer mechanistic interpretability I’m writing - I think it’s worth reading for a fairly clean and conceptual-focused implementation of a transformer, specifically reading EasyTransformer.forward and components.py (a file for the various layers) (the actual codebase is pretty long!)
Everything else Chris Olah has ever written
- I’m somewhat biased on this, but I think Chris is just clearly far and away the best interpretability researcher in the world.
- He’s also a massive nerd for good technical communication, interactivity and good graphic design, and I find his work a joy to read.
Interpreting RL Vision
- Interesting application of image circuits techniques to get some insight into an RL model - unclear how much it generalises/works
- The parts about the impact of the amount of and diversity of data on interpretability feel most interesting and general to me.
- Probably the best RL mechanistic interpretability paper I know of (but it’s a pretty low bar :( )
Not a paper: Playing around with OpenAI Microscope - visualizations and top dataset examples of every neuron in a ton of image models! Challenge: What’s the weirdest neuron you can find?
Visualizing and Interpreting the Geometry of BERT (+ blog post)
- An early LLM interpretability paper about understanding how BERT represents language in the residual stream.
- Deeply engage with:
  - Applying t-SNE to the residual stream + getting resulting visualizations. This was really clever and cool, and understanding it is valuable.
- Skim or skip:
  - The detailed syntax tree stuff.
Acquisition of Chess Knowledge in AlphaZero - analysing AlphaZero’s chess knowledge, including during training
- Notable for the hilarious stunt of getting a chess grandmaster commenting, and for co-authoring (even if this isn’t that interpretability related)
- Focuses on feature analysis rather than really mechanistic engagement, but still very cool! The main things I think are cool were successfully applying interpretability during training, and on the weird and fucky task of playing chess (and that models trained on non-image/language tasks are somewhat interpretable!).
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - a decent survey paper on what’s up in the rest of interpretability.
- I’m personally pretty meh about the majority of the academic field of interpretability (I rarely find insights from there useful in my work) and would prioritise reading the papers in the previous sections, but it’s worth skimming to get a sense for what’s out there, and digging into anything relevant to a specific project you’re pursuing!
  - Also, for sanity checking whether I’m just being overconfident/arrogant, and there’s actually a ton of useful insights in standard interpretability for mechanistic stuff! Again, this post is just a list of my personal hot takes.
- A Primer in BERTOLOGY - a survey paper specifically on BERTology, a subfield about specifically interpreting BERT. I feel pretty meh about this, but am not very familiar with the field.
The Building Blocks of Interpretability
- A cool and fun whirlwind tour of a bunch of different tools and approaches for image interpretability. Worth skimming.
Not a paper, but I find Chris Olah’s interview on the 80,000 Hours podcast super inspiring

[-]aogara2yΩ4107

Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are doing to be interesting to plenty of academics, and I would think that's what the academics are trying to do. Are they just bad at generating insights? Do they look for the wrong kinds of progress, perhaps motivated by different goals? Why is the large academic field of interpretability not particularly useful for x-risk motivated AI safety?

[-]Neel Nanda2y135

Honestly, I also feel fairly confused by this - mechanistic questions are just so interesting. Empirically, I've fairly rarely found academic interpretability that interesting or useful, though I haven't read that widely (though there are definitely some awesome papers from academia as linked in the post, and some solid academics, and many more papers that contain some moderately useful insight).

To be clear, I am focusing on mechanistic interpretability - actually reverse engineering the underlying algorithms learned by a model - and I think there's legitimate and serious work to be done in other areas that could reasonably be called interpretability.

My take would roughly be that there's a few factors - but again, I'm fairly confused by this and "it's actually great and I'm just being a chauvinist" is also a pretty coherent explanation (and I know some alignment researchers who'd argue for the latter hypothesis):

Doing rigorous mechanistic work is just fairly hard, and doesn't really fit the ML paradigm - it doesn't really work to frame in terms of eg benchmarks, and it's often more qualitative than quantitative. And thus is both difficulty and so hard to publish in.
Lots of interpretability/explainability work treats the ground truth as things like "do human operators rate this explanation as helpful" or "does this explanation help human operators understand the model's output better", which feel like fairly boring metrics to me, and not very relevant to mechanistic stuff.
Lots of work focuses too much on pretty abstractions (eg syntax trees and formal grammars) and not enough on grounding their work in what's actually going on inside the model.
Mechanistic interpretability is pre-paradigmatic - there just isn't an agreed upon way to make progress and find truth, nor an established set of techniques. This both makes it harder to do research in, and harder to judge the quality of work in (and thus also harder to publish in!).

I think ease of publishing is a pretty important point, even if an academic doesn't personally care about publications, often their collaborators/students/supervisors might, and there's strong career incentives to care. Eg, if a PhD student wants to work on this, and it'd be much harder to publish on it, a good supervisor probably should be discouraging of it (within reason), since part of their job is looking out for their student's career interests.

Though hopefully it's getting easier to publish in nowadays! There's a few mechanistic papers submitted to ICLR.

[-]the gears to ascension2y2-2

I mean, personally I'd say it's the only hope we have of making any of the reflection algorithms have any shot at working. you can't do formal verification unless your network is at least interpretable enough that, when formal verification fails, you can know what it is about the dataset means the network had to make your non-neural prover run out of compute time when you try to ask what the lowest margin to a misbehavior is. if the network's innards aren't readable enough to get an intuitive sense of why a subpath failed the verification or what computation the network performed that timed out the verifier, it's hard to move the data around to clarify.

of course, this doesn't help that much if it turns out that a strong planner can amplify a very slight value misalignment quickly as expected by miri folks; afaict, miri is worried that the process of learning (their word is "self improving") can speed up a huge amount when the network can make use of the full self-rewriting possibility of its substrate and the network properly understands the information geometry of program updates (ie, afaict, they expect significant amounts of generalizing improvement of architecture or learning rule or such things once its strong enough to become a strong quine as an incidental step of doing its core task.)

and so interpretability would be expected to be made useless by the ai breaking into your tensorflow to edit its own compute graph or your pytorch to edit its own matmul invocation order or something. presumably that doesn't happen at the expected level until you have an ai strong enough to significantly exceed the generalization performance of current architecture search incidentally without being aimed at that, because the ais they're imagining wouldn't have even been trained on that specifically the way eg alphatensor was narrowly aimed at matmul itself.

wow this really got me on a train of thinking, I'm going to post more rambling to my shortform.

[-]infinitevoid2yΩ010

Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.

a decent survey paper on what’s up in the rest of interpretability.
I’m personally pretty meh about the majority of the academic field of interpretability

A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify it as 'the rest of interpretability'.

Do you see a clear distinction between mechanistic interpretability methods vs the methods reviewed in this paper? If so, what's the distinction?

[-]Neel Nanda2yΩ120

This is a fair point! I honestly have only vaguely skimmed that survey, and got the impression there was a lot of stuff in there that I wasn't that interested in. But it's on my list to read properly at some point, and I can imagine updating this a bunch.

LESSWRONG
is fundraising!
LW
$

72