Cute construction! To check, am I correct that you're adding an attention head per neuron? To me that makes this prohibitive enough to not actually be useful for real models - eg, in GPT-2 Small that'd take you from 12 heads per layer to about 3,000 per layer.
That's right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer - the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I'm most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
- heads for the attention sublayer
- heads for the weight matrix calculating into the hidden layer
- heads for the activation function
- heads for the weight matrix calculating out of the hidden layer
Aren't attention networks and MLPs both subsets of feedforward networks already? What you really mean is "Attention can implement fully-connected MLPs"?
Calling fully-connected MLPs "feedforward networks" is common (e.g. in the original transformer paper https://arxiv.org/pdf/1706.03762.pdf), so I tried to use that language here for the sake of the transformer-background people. But yes, I think "Attention can implement fully-connected MLPs" is a correct and arguably more accurate way to describe this.
Given the general contempt that MLPs are held in at present, and the extent to which people seem to regard self-attention as magic pixie dust which cannot be replicated by alternatives like CNNs or MLPs and which makes Transformers qualitatively different from anything before & solely responsible for the past ~4 years of DL progress (earlier discussion defending MLP prospects), it might be more useful to emphasize the other direction: if you can convert any self-attention to an equivalent fully-connected MLP, then that can be described as "there is a fully-connected MLP that implements your self-attention". (Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right? Not the other way around, so converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown.)
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too). I wonder if you can tighten that to something much more relevant? My intuition is that MLPs are such powerful function approximators that you should be able to convert between much more similar-sized nets (and maybe smaller MLPs).
In either direction - perhaps you could just directly empirically approximate an exchange rate by training MLPs of various sizes to distill a self-attention layer? Given the sloppiness in attention patterns, it wouldn't necessarily have to be all that accurate. And you could do this for each layer to de-attend a NN, which ought to have nice performance characteristics in addition to being a PoC.
(My prediction would be that the parameter-optimal MLP equivalent would have a width vs depth scaling law such that increasing large Transformer heads would be approximated by increasingly skinny deep MLP stacks, to allow switching/mixing by depth. And that you could probably come up with an initialization for the MLPs which makes them start off with self-attention-like activity, like you can come up with Transformer initializations that mimic CNN inductive priors. Then you could just drop the distillation entirely and create a MLPized Transformer from scratch.)
Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?
Either I'm misunderstanding you or you're misunderstanding me, but I think I've shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don't claim that you can do a self-attention → MLP construction.
Converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown
This is what I think I show here! Let the unknown be known!
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).
Yes, this is definitely at an "interesting trivia" level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don't see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said "oh we can do interpretability on attention heads but not MLPs", so the conversion of the later into the former might supplement their techniques.
Yes, you're right. My bad; I was skimming in a hurry before heading out while focused on my own hobbyhorse of 'how to make MLPs beat Transformers?'. Knew I was missing something, so glad I checked. Now that you put it that way, the intuition is a lot clearer, and shrinking it seems a lot harder: one head per hidden dim/neuron is a straightforward construction but also unclear how much you could be guaranteed to shrink it by trying to merge heads...
The empirical approach, in both directions, might be the best bet here, and has the advantage of being the sort of thing that someone junior could get interesting results on quickly with minimal hardware.
[Epistemic status: Mathematically proven, and I have running code that implements it.]
Overview: A transformer consists of two alternating sublayers: attention heads and feedforward networks (FFNs, also called MLPs). In this post I’ll show how you can implement the latter using the former, and how you can convert an existing transformer with FFNs into an attention-only transformer.
My hope is that such a conversion technique can augment mechanistic interpretability tools such as the ones described in A Mathematical Framework for Transformer Circuits, by reducing the task of interpretability from “interpret attention and FFNs” to just “interpret attention”. That publication specifically points out that “more complete understanding [of Transformers] will require progress on MLP layers”, which I hope this technique can supply.
Limitations:
Notation
Fix a transformer T (such as GPT-3) which uses attention and feedforward networks. Write D=dmodel for the internal dimension of the model, N=nctx for the number of vectors in the context, and X for the “residual stream”, the N-by-D matrix storing the internal state of the model during a forward pass.
We will assume that the feedforward networks in T consists of an MLP with one hidden layer of width dff=4dmodel, using an activation function α(x)=SiLU(x)=xσ(x)[1]. To simplify notation, we will assume that bias terms are built into the weight matrices W1 and W2, which are respectively of sizes D-by-4D and 4D-by-D, so that the output of the feedforward network is α(XW1)W2, where α is applied to the matrix entry-wise.
We’ll follow this notation for attention heads, so that an attention head is characterized by its query-key matrix Q=WQK and its output-value matrix V=WOV, each of size D-by-D[2]. To simplify notation, we will assume that the “/√dk” step of attention has been folded into the Q matrix. Then the output of the attention head is softmax[XQ(XT)]XV, where the softmax operation is applied row-wise.
We assume that both the feedforward network and attention heads make use of skip connections, so that their output is added to the original residual stream. However, we ignore layer normalization.
Throughout, we will rely on a large number Ω whose purpose is to dwarf other numbers in the softmax operation of an attention head. In particular, we assume Ω has two properties:
In my code, Ω=1000 is sufficient for a tolerance of ε=10−10.
Construction Overview
We will convert the attention-and-feedforward model T into an attention-only model T’ by augmenting the residual stream, replacing the feedforward sublayers with attention sublayers, and tweaking the original attention heads to maintain their original behavior on the augmented residual stream.
We augment the residual stream of the model by:
In T, each layer consists of two sublayers:
In T’, these are replaced by:
The following sections will discuss these steps in the order (3), (2+4), (1), which is descending order of novelty to me.
Entry-wise SiLU via attention heads
One can apply SiLU to the residual stream with one attention head per dimension being SiLU’d. One uses the following Q and V matrices:
With this Q matrix, the jth row of XQXT will be of the form [−xjk,−xjk,…,−xjk,2Ω−xjk,−xjk,….,−xjk,2Ω], where k is the dimension being SiLU’d, and the 2Ωs are in the jth entry and the final entry. Then, after applying the softmax to this row, the row becomes [0,0,…,0,1−σ(xjk),0,…,0,σ(−xjk)] (to within error). That is, every vector is attended by only itself and the bias vector.
By our choice of V, the influence of a vector is the negative of its entry in the kth position. Thus the jkth entry of softmax[XQ(XT)]XVis −xjk(1−σ(xjk)), so after adding to the residual stream, one gets that the jkth entry of X+softmax[XQ(XT)]XV is xjkσ(xjk)=SiLU(xjk), as desired.
Vector-Wise Linear Transformations via Attention Heads
By putting such large weights in the self-positional-encoding matrices, a vector attends entirely to itself. Thus the output of the attention head is entirely the result of the V matrix, which can contain the arbitrary linear transformation of the feedforward network. Additional comments:
Tweaking the Original Attention Heads to Preserve Their Behavior
The addition of the new vector used for the activation function could potentially change the attention patterns of the preexisting attention heads, which would change the behavior of the network. However, we can slightly tweak the attention matrices in a normal attention head to prevent this issue:
By augmenting the attention matrix in this way, the bias vector strongly avoids attending to the non-bias vectors, and strongly attends to itself (preventing non-bias vectors from attending to the bias vector).
Demonstration Code
I’ve put Python code implementing this technique on github. Each of the three components (SiLU, linear transformations, normal attention) are implemented both directly and with attention heads. They are tested on random matrices with N=20 and D=30, and the largest error entries in each matrix are on the order of 10−14. I have not tested how such errors propagate through multiple layers.
Conclusion
One can also approximate ReLU with this technique, since SiLU(kx)/k → ReLU(x) as k→infinity. AIAYN uses ReLU, but GPT-3 uses GeLU.
For implementation purposes, these matrices are usually learned as low-rank factorizations, with WQK=WQWTK and a similar expression for WOV. However, it’s easier to construct the desired properties if we treat them in their full form. We will ignore rank restrictions except in the concluding comments.