If I understand correctly, this residual decomposition is equivalent to the edge / factorized view of a transformer described here.
Update: actually the residual decomposition is incorrect - see my other comment.
I agree, this seems like exactly the same thing, which is great! In hindsight it's not surprising that you / other people have already thought about this
Do you think the 'tree-ified view' (to use your name for it) is a good abstraction for thinking about how a model works? Are individual terms in the expansion the right unit of analysis?
Just to make it explicit and check my understanding - the residual decomposition is equivalent to edge / factorized view of the transformer in that we can express any term in the residual decomposition as a set of edges that form a path from input to output, e.g
= input -> output
= input-> Attn 1.0 -> MLP 2 -> Attn 4.3 -> output
And it follows that the (pre final layernorm) output of a transformer is the sum of all the "paths" from input to output constructed from the factorized DAG.
@Oliver Daniels-Koch's reply to my comment made me read this post again more carefully and now I think that that your formulation of the residual expansion is incorrect.
Given it does not follow that because is a non-linear operation. It cannot be decomposed like this.
My understanding of your big summation (with representing any MLP or attention head):
again does not hold because the s are non-linear.
There are two similar ideas which do hold, namely (1) the treeified / unraveled view and (2) the factorized view (both of which are illustrated in figure 1 here), but your residual expansion / big summation is not equivalent to either.
The treeified / unraveled view is the most similar. It separates each path from input to output, but the difference is that this does not claim that the output is the sum of all separate paths.
The factorized view follows from treeified view and is just the observation that any point in the residual stream can be decomposed into the outputs of all previous components.
If I understand correctly, you're saying that my expansion is wrong, because , which I agree with.
Yes is what I'm saying.
That makes sense to me. I guess I'm dissatisfied here because the idea of an ensemble seems to be that individual components in the ensemble are independent; whereas in the unraveled view of a residual network, different paths still interact with each other (e.g. if two paths overlap, then ablating one of them could also (in principle) change the value computed by the other path). This seems to be the mechanism that explains redundancy.
I think this makes sense.
I am not sure how new this approach is (for simplified Transformers, the original AMFOTC paper has several sections called "* Path Expansion *", which seem to do something very similar for a reduced set of transformations, and their formalism of "virtual attention heads" seems also to be in that spirit).
Fair point, and I should amend the post to point out that AMFOTC also does 'path expansion'. However, I think this is still conceptually distinct from AMFOTC because:
maybe this post is better framed as 'reconciling AMFOTC with SAE circuit analysis'.
Yes, I think this makes sense.
Here is one aspect which might be useful to keep in mind.
If we think about all this as some kind of "generalized Taylor expansion", there are some indications that the deviations from linearity might be small.
E.g. there is this rather famous post, https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear.
Another indication pointing to "almost linearity" is that "model merge" works pretty well. Although, interestingly enough, people often prefer to approach "model merge" in a more subtle fashion than just linear interpolation, so, presumably, non-linearity does matter quite a bit as well, e.g. https://huggingface.co/blog/mlabonne/merge-models.
Edit: The math here has turned out to be wrong. See Joseph Miller's reply here. I will revise the main content of this post at some point to reflect this.
This is an informal note describing my current approach for thinking about transformer circuits. I've not spent a lot of time thinking deeply about this but I believe the overall claims here are correct.
Note: A lot of the high-level ideas I include here are not really original, but I haven't seen the specific framing here applied to transformers, and I would like more people to think about this / tell me whether this is obviously flawed in some way.
AMFOTC and its Limitations
A Mathematical Framework is one of my favourite mech interp papers ever, and has spawned a very successful subfield of circuit analysis. I really like it because it provides a general framework for how to think about transformers and circuits.
However, there are some notable limitations to this framework:
Also, this framework centralises on the 'model basis' (attention heads, residual stream) and fails to incorporate other ideas (superposition, SAEs).
So I spent some time thinking about how we might extend this framework and here's what I came up with.
The Residual Expansion
A very old idea in machine learning, dating all the way back to ResNets, is that a sequence of residual operations can be 'expanded out' into a set of feedforward operations
To make this concrete, let's consider a 1-layer transformer with attention and MLP blocks.
More generally, for an N-layer transformer, we can write a big summation of terms:
T=(Id)+(n−1∑i=0MLPi+n−1∑i=0Atti)+⋯
What does this get us?
The residual decomposition gives us a sum of feedforward paths through the model, each of which is nonlinear.
Circuits. A circuit could possibly be represented as a sum of a small number of these feedforward paths.
AMFOTC. The residual decomposition is fully compatible with AMFOTC
Individual terms. Generally, many other ideas in interpretability can be thought of as attempts to understand individual terms in the residual decomposition
Other remarks
Tl;dr I think this is a nice unifying way to think about lots of circuit analysis work.
Open Questions / Ideas
Here's some ideas motivated by this line of thinking.
Conclusion
I'm very interested to hear takes on this!