I strongly upvoted this post because of the "Tips" section, which is something I've come around on only in the last ~2.5 months.
neuron has
I was confused by the singular "neuron."
I think the point here is that if there are some neurons which have low activation but high direct logit attribution after layernorm, then this is pretty good evidence for "smuggling."
Is my understanding here basically correct?
This happens in transformer MLP layers. Note that the hidden dimen
Is the point that transformer MLPs blow up the hidden dimension in the middle?
Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, and they have many more features than neurons (as shown in the recent Anthropic paper)
Important Note: Since writing this, there's been a lot of exciting work on understanding superposition via training sparse autoencoders to take features out of superposition. I recommend reading up on that work, since it substantially changes the landscape of what problems matter here.
This is the fifth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivating papers: Toy Models of Superposition, Softmax Linear Units
Background
If you're familiar with polysemanticity and superposition, skip to Motivation or Problems.
Neural networks are very high dimensional objects, in both their parameters and their activations. One of the key challenges in Mechanistic Interpretability is to somehow resolve the curse of dimensionality, and to break them down into lower dimensional objects that can be understood (semi-)independently.
Our current best understanding of models is that, internally, they compute features: specific properties of the input, like "this token is a verb" or "this is a number that describes a group of people" or "this part of the image represents a car wheel". That early in the model there are simpler features, are later used to compute more complex features by being connected up in a circuit (example shown above (source)). Further, our guess is that features correspond to directions in activation space. That is, for any feature that the model represents, there is some vector corresponding to it. And if we dot product the model's activations with that vector, we get out a number representing whether that feature is present.(these are known as decomposable, linear representations)
This is an extremely useful thing to be true about a model! An even more helpful thing to be true would be if neurons correspond to features (ie the output of an activation function like ReLU). Naively, this is natural for the model to do, because a non-linearity like ReLU acts element-wise - each neuron's activation is computed independently (this is an example of a privileged basis). Concretely, if a neuron can represent feature A or feature B, then that neuron will fire differently for feature A and NOT feature B, vs feature A and feature B, meaning that the presence of B interferes with the ability to compute A. But if each feature is its own neuron we're fine!
If features correspond to neurons, we're playing interpretability on easy mode - we can focus on just figuring out which feature corresponds to each neuron. In theory we could even show that a feature is not present by verifying that it's not present in each neuron! However, reality is not as nice as this convenient story. A countervailing force is the phenomena of superposition. Superposition is when a network represents more features than it has dimensions, and squashes them all into a lower dimensional space. You can think of superposition as the model simulating a larger model.
Anthropic's Toy Models of Superposition paper is a great exploration of this. They build a toy model that learns to use superposition (notably different from a toy language model!). The model starts with a bunch of independently varying features, needs to compress these to a low dimensional space, and then is trained to recover each feature from the compressed mess. And it turns out that it does learn to use superposition!
Specifically, it makes sense to use superposition for sufficiently rare (sparse) features, if we give it non-linearities to clean up interference. Further, the use of superposition can be modelled as a trade-off between the costs of interference, and the benefits of representing more features. And digging further into their toy models, they find all kinds of fascinating motifs regarding exactly how superposition occurs, notably that the features are sometimes compressed in geometric configurations, eg 5 features being compressed into two dimensions as the vertices of a pentagon, as shown below.
Motivation
Zooming out, what does this mean for what research actually needs to be done? To me, when I imagine what real progress here might look like, I picture the following:
The direction I'm most excited about is a combination of 1 and 2, to form a rich feedback loop between toy models and real models - toy models generate hypotheses to test, and exploring real models generates confusions to study in toy models.
Resources
Tips
Problems
This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)
Notation: ReLU output model is the main model in the Toy Models of Superposition paper which compresses features in a linear bottleneck, absolute value model is the model studied with a ReLU hidden layer and output layer, and which uses neuron superposition.
- Confusions about models that I want to see studied in a toy model:
- A* 4.1 - Does dropout create a privileged basis? Put dropout on the hidden layer of the ReLU output model and study how this changes the results. Do the geometric configurations happen as before? And are the feature directions noticeably more (or less!) aligned with the hidden dimension basis?
- B-C* 4.2 - Replicate their absolute value model and try to study some of the variants of the ReLU output models in this context. Try out uniform vs non-uniform importance, correlated vs anti-correlated features, etc. Can you find any more motifs?
- B* 4.3 - Explore neuron superposition by training their absolute value model on a more complex function like
- B* 4.4 - What happens to their ReLU output model when there's non-uniform sparsity? Eg one class of less sparse features, and another class of very sparse features.
- Explore neuron superposition by training their absolute value model on functions of multiple variables:
- A* 4.5 - Make the inputs binary (0 or 1), and look at the AND or OR of pairs of elements
- B* 4.6 - Keep the inputs as uniform reals in
- Adapt their ReLU output model to have a different range of feature values, and see how this affects things. Currently the features are uniform `[0, 1]` if on (and 0 if off):
- A* 4.7 - Make the features 1 (ie exactly two possible values)
- B* 4.8 - Make the features discrete, eg 1, 2 or 3
- B* 4.9 - Make the features uniform
- A-B* 4.10 - What happens if you replace ReLUs with GELUs in their toy models? (either for the ReLU output model, or the absolute value model). Does it just act like a smoother ReLU?
- C* 4.11 - Can you find a toy model where GELU acts significantly differently from ReLU? A common intuition is that GELU is mostly a smoother ReLU, but close to the origin GELU can act more like a quadratic. Does this ever matter?
- C* 4.12 - Build a toy model of a classification problem, where the loss function is cross-entropy loss (not mean squared error loss!)
- C* 4.13 - Build a toy model of neuron superposition that has many more hidden features to compute than output features. Ideas:
- Have n input features and an output feature for each pair of input features, and train it to compute the max of each pair.
- Have discrete input data, eg if it's on, take on values in
- C* 4.14 - Build a toy model of neuron superposition that needs multiple hidden layers of ReLUs. Can computation in superposition happen across several layers? Eg
max(|x|,|y|)x -> x^2
. This should need multiple neurons per function to do well[0, 1]
and look atmax(x, y)
[0.5, 1]
[1.0,2.0,3.0,4.0,5.0]
, and have 5 output features per input feature, with the label being[1,0,0,0,0],[0,1,0,0,0],...
and mean-squared error loss.A ... B -> A
(ie, if the current token is B, and token A occurred in the past, predict that A comes next).attn-only-2l
in TransformerLens. To study Indirect Object Identification, look atgpt2-small
.solu-1l
?