When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research.
Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves.
Examples include:
- A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)
- Toy Models of Superposition (Elhage et al., 2022)
- Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022)
- Interpreting Neural Networks through the Polytope Lens (Black et al., 2022)
- Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023)
- Research agenda: Formalizing abstractions of computations (Jenner, 2023)
- Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag).
Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory.
Examples include
- Steering GPT-2-XL by adding an activation vector (Turner et al., 2023)
- Discovering Latent Knowledge in Language Models (Burns et al., 2022)
- The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022)
- In-context Learning and Induction Heads (Olsson et al., 2022)
- We Found An Neuron in GPT-2 (Miller et al., 2023)
- Language models can explain neurons in language models (Bills et al., 2023)
- Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021)
Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut:
- Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory).
- Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing, where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions).
- Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens lends weight to the theory that transformer language models do iterative inference).
Why I currently prioritize fundamental interpretability
Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe.
But given our current position on the tech tree, I find that I care more about fundamental interpretability.
The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpretability' [1].
We need comprehensive interpretability in order to have strong-ish confidence about whether dangerous representations or circuits exist in our model. If we don't have strong-ish confidence, then many theories of impact for interpretability are inordinately weakened:
- We're a lot less able to use interpretability as a 'force multiplier on alignment research' because we can't trust that our methods haven't missed something crucial. This is particularly true when models are plausibly optimizing against us and hiding dangerous thoughts in places we aren't looking. A similar pattern holds for theories of impact based on 'Empirical evidence for/against threat models', 'Improving human feedback', and 'Informed oversight'.
- We can't be confident about our interpretability audits. Not only does this raise the risk that we'll miss something, but it makes it much harder to justify including interpretability in regulations, since effective regulation usually requires technical clarity. It also makes it harder for clear norms around safety to form.
- We don't get the coordination/cooperation benefits resulting from some actors being able to actually trust other actors' systems.
- We definitely can't use our interpretability methods in the loss function. To be clear, we probably shouldn't do this even if we believed we had comprehensive interpretability. We'd probably want provably comprehensive interpretability (or some other reason to believe that our interpretability methods wouldn't simply be circumvented) before we could safely justify using them in the loss function.
For most of these theories of impact, the relationship feels like it might be nonlinear: A slight improvement to interpretability that nevertheless falls short of comprehensive interpretability does not lead to proportional safety gains; only when we cross a threshold to something resembling comprehensive interpretability would we get the bulk of the safety gains. And right now, even though there's a lot of valuable applied work to be done, it feels to me like progress in fundamental interpretability is the main determinant of whether we cross that threshold.
- ^
Similar terms for 'comprehensive interpretability' include Anthropic's notion of 'enumerative safety', Evan Hubinger's notion of 'worst-case inspection transparency', and Erik Jenner's notion of 'quotient interpretability'.
Interesting thoughts!
It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:
I don't necessarily totally agree with her phrasing but it does feel a bit like we are all gesturing at something vaguely similar (and I do agree with her that PDLT-esque work may have more insights in this direction than some people on our side of the community have appreciated).
FWIW, in a recent comment reply to Joseph Bloom, I also ended up saying a bit more about why I don't actually see myself working much more in this direction, despite it seeming very interesting, but I'm still on the fence about that. (And one last point that didn't make it into that comment is the difficulties posed by a world in which increasingly the plucky bands of interpretability researchers on the fringes literally don't even know what the cutting edge architectures and training processes in the biggest labs even are.