A Comprehensive Mechanistic Interpretability Explainer & Glossary

Neel Nanda

91 A Comprehensive Mechanistic Interpretability Explainer & Glossary

by Neel Nanda

21st Dec 2022

AI Alignment Forum

3 min read

6

91 Ω 43

This is a linkpost for https://neelnanda.io/glossary

This is a linkpost for a very long doc defining, explaining, and giving intuitions and conceptual frameworks for all the concepts I think you should know about when engaging with mechanistic interpretability. If you find the UI annoying, there's an HTML version here

Why does this doc exist?

The goal of this doc is to be a comprehensive glossary and explainer for Mechanistic Interpretability (focusing on transformer language models), the field of studying how to reverse engineer neural networks.
There's a lot of complex terms and jargon in the field! And these are often scattered across various papers, which tend to be pretty well-written but not designed to be an introduction to the field as a whole. The goal of this doc is to resolve some research debt and strives to be a canonical source for explaining concepts in the field
I try to go beyond just being a reference that gives definitions, and to actually dig into how to think about a concept. Why does it matter? Why should you care about it? What are the subtle implications and traps to bear in mind? What is the underlying intuition, and how it fits into the rest of the field?
I also go outside pure mechanistic interpretability, and try to define what I see as the key terms in deep learning and in transformers, and how I think about them. If you want to reverse engineer a system, it's extremely useful to have a deep model of what's going on inside of it. What are the key components and moving parts, how do they fit together, and how could the model use them to express different algorithms?

How to read this doc?

The first intended way is to use this a reference. When reading papers, or otherwise exploring and learning about the field, coming here and looking up any terms and trying to understand them.
The second intended way is to treat this as a map to the field. My hope is that if you're new to the field, you can just read through this doc from the top, get introduced to the key ideas, and be able to dig into further sources when confused. And by the end of this, have a pretty good understanding of the key ideas, concepts and results!
It's obviously not practical to fully explain all concepts from scratch! Where possible, I link to sources that give a deeper explanation of an idea, or to learn more.
- More generally, if something’s not in this glossary, you can often find something good by googling it or searching on alignmentforum.org. If you can’t, let me know!
I frequently go on long tangents giving my favourite intuitions and context behind a concept - it is not at all necessary to understand these (though hopefully useful!), and I recommend moving on if you get confused and skimming these if you feel bored.

91 Ω 43

Mentioned in

84Decision Transformer Interpretability

36A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)

13graphpatch: a Python Library for Activation Patching

New Comment

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:47 PM

[-]Dalcy2y72

Just wanted to comment that this is an absolutely amazing resource and have saved me a ton of time trying to get into this field & better understand several of the core papers. Thank you so much for writing this!

Reply

[-]Neel Nanda2y50

Thanks :) I'm happy to hear that people are actually using it!

Reply

[-]Logan Riggs2yΩ120

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Reply

[-]Jett Janiak2y10

The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.

Reply

[-][anonymous]2y10

Thanks for writing this. A question:

Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.

Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"?

Because some could be just calculations to get to a higher-level features (part of a circuit).

Reply

[-]Neel Nanda2y20

Fair point, corrected.

Because some could be just calculations to get to a higher-level features (part of a circuit).

IMO, the intermediate steps should mostly be counted as features in their own right, but it'd depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.

Reply

Moderation Log

LESSWRONG
LW

91

A Comprehensive Mechanistic Interpretability Explainer & Glossary

91

Ω 43

Why does this doc exist?

How to read this doc?

Table of Contents

91

Ω 43