The activation patching, causal tracing and resample ablation terms seem to be out of date, compared to how you define them in your post on attribution patching.
Thanks for writing this. A question:
Features as neurons is the more specific hypothesis that, not only do features correspond to directions, but that each neuron corresponds to a feature, and that the neuron’s activation is the strength of that feature on that input.
Shouldn't it be "each feature corresponds to a neuron" rather than "each neuron corresponds to a feature"?
Because some could be just calculations to get to a higher-level features (part of a circuit).
Fair point, corrected.
Because some could be just calculations to get to a higher-level features (part of a circuit).
IMO, the intermediate steps should mostly be counted as features in their own right, but it'd depend on the circuit. The main reason I agree is that neurons probably still do some other stuff, eg memory management or signal boosting earlier directions in the residual stream.
This is a linkpost for a very long doc defining, explaining, and giving intuitions and conceptual frameworks for all the concepts I think you should know about when engaging with mechanistic interpretability. If you find the UI annoying, there's an HTML version here
Why does this doc exist?
How to read this doc?
Table of Contents