By decomposing the first-layer attention heads of GPT2-small into position-dependent and content-dependent components, I identify a striking pattern: six attention heads share a common positional structure, attending to previous tokens with an exponential decay over approximately 50 tokens. These heads also each exhibit consistent statistical regularities in how they process diverse inputs. These shared characteristics – both positional and statistical – allow us to unify the output-value (OV) circuits of all six heads into what I term a ”contextual circuit.”
I show that this unified circuit’s behaviour can be approximated as constructing a bag-of-tokens representation of the previous 50 tokens. This finding places important constraints on how first-layer contextual features must be structured: features such as sentiment, language, or writing style must emerge from distinct distributions of tokens, resulting in characteristic ”bags of tokens.”
To validate the bag-of-tokens model, I use it to predict contextual neurons – first-layer MLP neurons that detect specific contextual features like ”Spanish text” or ”Emotive language” – without needing to run the model or have access to a large text corpus. I demonstrate that our approximation accurately predicts these neurons’ activations when tested against positive and negative examples of their respective contextual features, providing strong evidence for our mechanistic understanding.
There are lots of contextual neurons which seem monosemantic, whereas others seem more complicated. A few example contextual neurons:
I approximate the layer norm as linear because its scale consistently falls within 0.15–0.175. I refine this approximation in the appendix. When referring to WE and Wpos, I take these to be post layer norm approximation.
I write E for the token embedding WE, Q for the query matrix WQ, K for the key matrix WK,and P for the positional embedding Wpos.
I simplify notation by concatenating letters to denote matrix multiplication, applying transpositions when appropriate. For instance, EQKE refers to WEWQWKTWET. Consider a sequence of tokens x1,x2,x3,…,xn, where n represents the current destination position.
For any position i≤n, the ith attention score measures the attention weight that xn places on xi.
This score combines contributions from both token embeddings (E) and positional embeddings (P) through query and key matrices:
Technically, GPT-2 has bias terms on each of the queries and keys, and layer-norm has a weight and bias as well. These bias terms make the equations harder to read but don't cause any complications, so I omit them here.
The transformer applies a softmax operation to these attention scores, exponentiating each score for i≤n and normalizing the resulting vector. The exponentiated attention score decomposes into two independent components:
pi: depends exclusively on the token's position, i.
fxi: depends exclusively on the token's content, xi.
I define the positional pattern as: posi=pi∑nj=1pj
Given the destination token xn, this pattern can be computed independently of the other sequence tokens, representing the softmax of the position-dependent components of the attention score.
The positional pattern can be viewed as telling you how much each position in the sequence gets weighted, independent of the content of the token at that position. It is the attention pattern you would get if every token in the sequence were identical.
The positional pattern is slightly influenced by the destination token xn. However, for all attention heads in GPT-2-small’s first layer, the overall character of the positional pattern remains consistent regardless of xn.
The final softmax probabilities at position n are: soft_prob[i]=pifxi∑nj=1pjfxj=posifxi∑nj=1posjfxj, for i≤n.
Positional pattern visualization:
Below is the positional pattern of each of the first layer attention heads for n=400, xn=' the', and a sample attention pattern obtained from running the model on the Bible. Each of the attention patterns of length 400 have been reshaped into a 20x20 grid for visualization purposes.
Head 0 positional pattern
Sample Head 0 attention pattern
Head 1 positional pattern
Sample Head 1 attention pattern
Head 2 positional pattern
Sample Head 2 attention pattern
Head 3 positional pattern
Sample Head 3 attention pattern
Head 4 positional pattern
Sample Head 4 attention pattern
Head 5 positional pattern
Sample Head 5 attention pattern
Head 6 positional pattern
Sample Head 6 attention pattern
Head 7 positional pattern
Sample Head 7 attention pattern
Head 8 positional pattern
Sample Head 8 attention pattern
Head 9 positional pattern
Sample Head 9 attention pattern
Head 10 positional pattern
Sample Head 10 attention pattern
Head 11 positional pattern
Sample Head 11 attention pattern
Behavioural Classification of First Layer Attention Heads:
Based on the positional patterns, and observation of the fxi values, we can classify the first layer attention heads into distinct behavioural groups:
Group
Heads
Behaviour
Detokenization heads
3, 4, 7
Attends mainly to the previous 5 tokens; used for detecting known n-grams. Their positional pattern is mostly translation invariant, so that n-grams have consistent representations regardless of position.
Contextual attention heads
0, 2, 6, 8, 9, 10
Positional pattern that exponentially decays over 50 tokens. fxi tends to be pretty consistent across different destination tokens xn. Head 10 has duplicate token behaviour, but tends to attend pretty uniformly to non-duplicate tokens.
Duplicate token heads
1, 5
Attends almost entirely to duplicate copies of the current token xn. Head 5 detects duplicate tokens close to uniformly, Head 1 only detects nearby duplicate tokens so it can do more precise relative indexing.
Miscellaneous
11
Uncertain of its role
Approximating Softmax Probabilities:
Continuing from soft_prob[i]=posifxi∑nj=1posjfxj, where n is the destination position:
This expression is difficult to work with because the denominator of soft_prob[i] depends on all tokens xj in the sequence, rather than solely on i and xi.
For contextual attention heads, however, the empirical value of the denominator ∑nj=1posjfxj depends mostly on the destination position n, rather than the particular xj values. When n is fixed, the denominator tends to concentrate around the same value for a variety of input sequences.
Fix xn=‘‘the‘‘, assuming that the destination token doesn't affect the content-dependent component too much. Even if this assumption breaks down for certain contextual attention heads, it will at least tell us what happens when the destination token is a stopword.
For this fixed xn, I plotted the denominator of the softmax for various texts as a function of the destination position n:
The normalisation factor initially decays because the <end-of-text> token has a large content-dependent component. This reduces the impact of the contextual attention heads for the first ~50 tokens of the sequence.
Of the contextual heads, heads 6 and 8 vary the most in normalisation factor across different contexts. A lot of the variation can be attributed to differences in the percentage of keywords in the text. Head 6 emphasises keywords, whereas Head 8 emphasises stopwords. The Bible has more newline characters than other texts because it is composed of verses, so it has a below-average normalisation factor for head 6, and an above-average normalisation factor for head 8.
Independent model of normalisation factors:
For a fixed text, certain heads seem to have denominators which are a smoother function of n than others. What determines this?
As a simple model, imagine that each xj was independently drawn from some distribution with P(xj=i)=qi, with xn fixed. Then we have Var(∑nj=1posjfxj)=(∑nj=1pos2j)Var(fxi).
Because of the constraint that ∑nj=1posj=1, ∑nj=1pos2j becomes a measure of how spread out the positional pattern is. If an attention head pays 1m attention to m tokens, then ∑nj=1pos2j=1m. So, the more tokens a positional pattern is spread over, the lower the variance in its normalisation factor. The contextual attention heads are averaging over enough tokens that the normalisation factor varies smoothly.
Var(fxi) is determined by how much the content-dependent weighting varies between tokens. Head 6 and Head 9 have a very similar positional pattern, but head 6 emphasises keywords above stopwords, whereas head 9 attends fairly evenly to tokens. Therefore Head 6 has a higher variance than head 9.
Heads 3, 4, and 7 (the de-tokenisation heads) all have high variance in their content-dependent component, and a positional pattern which isn't spread over many tokens. Hence the confetti. Of course, this doesn't mean these heads are particularly hard to analyse, it's just not appropriate to model them as having a constant normalisation factor for fixed n.
The argument is:
1.) Contextual attention heads produce an empirical average of content-dependent components
2.) Because contextual attention heads average over many tokens, the empirical average in a particular context will be close to the global average in said context.
3.) The model has learnt to have the global average of the content-dependent component in a variety of different contexts be basically independent of the particular context.
Note the model doesn't have to work particularly hard to get the third property. Completely random content-dependent components would work.
Arguably a lot of the trouble in reading off algorithms from the weights is that models often learn algorithms that rely upon statistical regularities like the softmax denominator, which are invisible at the weights level. You need to inject some information about the input distribution to be able to read off algorithms, even if it's just a couple inputs to allow you to find constants of the computation.
For toy models, we can analyse contextual attention heads by partitioning inputs based on their softmax denominator values. While computing all the elements of this partition would be exponentially expensive, we can use concentration inequalities to bound the size of each partition - i.e., how likely inputs are to produce each range of denominator values. This lets us prove formal performance bounds without having to enumerate all possible inputs.
Contribution to Residual Stream from Contextual Attention Heads:
Once again, fix the destination token xn to be ' the'. Define normh as the median of the empirical normalization factors for head h.
I approximate the contextual attention heads as each having the same positional pattern posi, as their true positional patterns are quite similar. Small differences in the positional pattern are unlikely to matter much because the EVO output of head h will concentrate around the expected value of WE[xi]VOhfhxinormh, regardless of the specific positional pattern.
Thus, the approximate output of the contextual attention heads can be expressed as:
∑nj=1posi∑h∈{0,2,6,8,9,10}WE[xi]VOhfhxinormh
Because I fixed the destination token xn, and normh is a constant,
∑h∈{0,2,6,8,9,10}WE[xi]VOhfhxinormh is a function of just xi and n. I refer to this as output[xi].
The output to the residual stream from the contextual attention heads is then approximately ∑nj=1posioutput[xi].
output[xi] can be interpreted as an extended embedding of the token xi, optimized for contextual classification. The combined output of all the contextual attention heads to the residual stream is approximately an exponentially decaying average of these extended embeddings over the previous ~50 tokens.
Technically, output[xi] depends on the destination position n. It would be very strange if these extended embeddings changed significantly depending on n, although it might be able to implement a correction term for layer norm, discussed in the appendix.
If these extended embeddings were one-hot encodings of tokens, and the average was uniform instead of an exponential decay, the output would be a bag-of-words, which is often used as input to classification algorithms like Naive Bayes.
In practice, the model only has 768 dimensions in its residual stream rather than 50257; which is the number of dimensions it would need to one-hot encode each token. This shouldn't cause too many issues for Naive Bayes classification, though, since contextual features are sparse. Nonetheless, it would be useful to know precisely how much this dimensionality restriction, and the fact that the average is an exponential decay rather than uniform, affects the model's ability to learn contextual features.
Contextual Neurons:
I define contextual neurons to be first-layer MLP neurons which have a significant component due to the EVO circuit of the contextual attention heads.
You can use the extended embedding approximation to find hundreds of different interesting contextual neurons by looking at the composition of the extended embeddings with the MLP input, after accounting for the MLP layer-norm. This is demonstrated in the accompanying Google Colab.
Below I show a couple of these contextual neurons to give an idea for the contexts the model finds important to represent. These contextual neurons are just reading off directions which have been constructed by the contextual attention heads, and in general, there's no reason to expect the directions constructed by the contextual attention heads to align with the neuron basis. Some of the contextual neurons seem to be monosemantic, whereas others seem to be complex to understand.
Each contextual neuron has a token contribution vector associated with it listing the composition of each token's extended embedding with the MLP neuron's input. The way to interpret this is that the transformer adds up an exponentially decaying average of these token contributions over the sequence. Positive token contributions update the neuron towards firing, and negative token contributions update the neuron against firing. Neurons might have an initial bias so that they only fire if the positive contributions get over a certain threshold.
Britain vs America (Neuron 300):
Top positive contributions:
Bottom negative contributions:
There is another neuron for detecting British contexts, neuron 704, which has an output in the opposite direction. Plausibly they cancel each other out to avoid their total output getting too large.
19-20th century conflict? (Neuron 1621):
Top positive contributions:
The years (1917,1918,1942,1943,1944) would indicate this is related to WW2/WW1, but 'Sherman' was a general during the American Civil War.
Bottom negative contributions:
These token contributions feel like a 'War' latent − '21st century' latent.
If this was the case, the top contributions and bottom contributions would be misleading, and wouldn't necessarily inform you about what is going on with the median contributions.
It would be useful if there was a way to automatically decompose the token contributions vector of a neuron into separate sub-latents. Potentially some sort of SVD technique could be helpful here.
Evaluating the approximation of contextual neurons:
Now that we have examples of neurons which we think are detecting certain specialised contexts, we can use these neurons to evaluate the above approximations.
The above gives an approximation for the EVO circuit of contextual attention heads. To estimate the normalization factor, I use a fixed control text with an average number of keywords. I approximate the PVO contribution using each head's positional pattern. For the other attention heads, I approximate the EVO circuit as attending solely to the current token. I calculate the E and P circuits directly.
It's easiest to show a variety of these approximations in the Google Colab, but here is a typical example, for neuron 1710, which fires on religious texts:
The approximation does tend to capture the overall fit of the graphs, but the approximations at individual points aren't very good. There is noise of about ±0.5.
The approximation tends to agree on the decision boundaries of neurons, at least, but it wouldn't satisfy someone looking for l2-distance bounds.
Assuming there is no error in the non-contextual heads or second layer-norm approximation, the main source of error would come from the xn = ' the' approximation. Most alternative xn seem to be consistent in normalization factor across a wide variety of contexts, even when the context in question is semantically related to xn. This would make our 'bag of words' a function of xn, at which point the claim about a 'bag of words' is that this doesn't vary too much with xn.
We'd need to know more about future layers to know how significant the error term is for understanding the broad behaviour of the model. For instance, future layers might use an averaging head to remove the noise from the neuron activations, at which point this approximation would be a good substitute. But if future layers make significant use of the exact neuron activations, this would require us to understand what's going on far better.
Regardless, validating that this mechanism makes more or less the same decisions as the model is exciting, because it feels like I have learnt at a high-level what the model is doing, even if the particulars would require a more lengthy explanation.
Rotary Embeddings:
Rotary embeddings can implement contextual attention heads by having queries and keys that mainly use the lower frequencies, as discussed in ROUND AND ROUND WE GO! WHAT MAKES ROTARY POSITIONAL ENCODINGS USEFUL?. At sufficiently low frequencies, the positional dependence will mostly drop out, which is what you want for an attention head that summarizes information.
Contextual attention heads are what rotary embeddings are designed to make easy to construct. It seems difficult to get as neat of a decomposition as when the positional embeddings lie in the residual stream, though.
Further Work:
Are any of the contextual neurons shown above actually monosemantic? Is there a way to use the token contributions vector associated with a neuron to decompose its activations into sub-latents or find unrelated contexts? Is there a universality to the contextual neurons which models learn when trained on the same dataset?
Is there a way to extend the notion of a positional pattern to later layers? The existence of previous token heads in later layers indicates this is at least worth looking for, and empirically many attention heads in later layers seem to have quite well-defined positional patterns. Positional information interacting with contextual information in a non-additive way could be a barrier to this, but the existence of neurons which are mostly positional is a promising sign.
Looking at large entries of the EQKE matrix of the 'de-tokenization' heads 3, 4, and 7 seems promising for finding 'known bigrams', but how large do these entries need to be for the model to distinguish a particular bigram? Do different detokenization heads specialise in distinct sorts of bigrams? You can imagine that the model might want to treat the bigram ' al','paca' differently from ' Barack',' Obama'.
Is it possible to refine the approximation for the normalization factor? Is it mostly a function of keyword density, or is there more to it? Is the model doing something more sophisticated that is lost in treating the normalization factor as a constant (almost certainly)?
The assumption that the destination token xn doesn't affect the approximation too much needs more investigation. I have ignored pretty much all the internal structure of these heads, so there is lots of important information missing from this approximation.
Is there interesting developmental interpretability that can be done just by tracking positional patterns throughout training?
Here I didn't investigate the structure of any of the VO matrices. Certain heads seem like they are specialising in different things, and you'd expect there to be corresponding structure in their VO matrices. Understanding this structure would make clearer situations in which you should be more/less worried about the normh approximation being inaccurate. Do certain VO matrices cancel each other out? How does this interact with layer-norm? It would be interesting to look at how the VO matrices for head 2 and head 0 get used, since these seem to be the heads which vary the most with the destination token xn.
Acknowledgements:
This post benefitted significantly from discussion with and feedback from Euan Ong and Jason Gross.
Appendix:
Analysis of Attention Patterns with Layer Normalisation:
I analyse how layer normalisation affects attention patterns by decomposing the attention weights into positional and content-dependent components. For tractability, fix a destination token xn='the' and examine how attention to previous tokens varies.
Let Qxn denote the query vector derived from the post-layer-norm embedding of xn, which we can compute exactly. The attention weight to a previous token xi can be decomposed into:
fxi=eQxnWTK√dmodelWE[xi]8|WE[xi]+Wpos[i]|
where WE[xi] represents the token embedding and Wpos[i] the positional embedding.
and pi=eQxnWTK√dmodelWpos[i]8|WE[xi]+Wpos[i]|
Empirically, token and positional embeddings are approximately orthogonal. This allows us to approximate:
|WE[xi]+Wpos[i]|≈√|WE[xi]|2+|Wpos[i]|2
|Wpos[i]|≈3.35 for i>100, and the positional pattern only attends to the previous roughly 50 tokens so that we can approximate the content-dependent component fxi by:
fxi=eQxnWTK√dmodelWE[xi]8√|WE[xi]|2+Wpos[n]2
This approximation recovers the sole dependence of fxi on xi and n.
Positional component analysis:
The positional component of attention is given by: pi=eQxnWTK√dmodelWpos[i]8|WE[xi]+Wpos[i]|
Once again, I construct a probabilistic model of what will occur.
To analyse this, introduce a mean approximation by averaging over token embeddings according to some distribution q: ^pi=eQxnWTK√dmodelWpos[i]8Eq[|WE+Wpos[i]|]
The relationship between actual and mean components can be written as: pi=(^pi)Eq[|WE+Wpos|]|WE[xi]+Wpos[i]|
Using a Taylor expansion around the mean component: pi=^pi(1+ln(^pi)Δi+ln(^pi)2Δ2i2+…)
This represents the relative deviation from mean normalisation.
As before, define posi=pi∑nj=1pj. Unlike before, pi now depends on xi, and posi depends on all terms in the sequence.
To handle this, we use a similar argument to the normalisation factor: n∑j=1pj=n∑j=1^pj(1+ln^pjΔj+ln^pj2Δ2j2+…)where:ln^pj=QxnWTK√dmodelWpos[j]8Eq[|WE+Wpos[j]|]
Importantly, ln^pn varies significantly with n. For n=500, this term is about 0, whereas for n=1000, it's about 5. This indicates the model is more sensitive to layer normalization further from the center of the sequence. At the center, there is very little distortion from layer norm.
In any case, |ln^pi| is bounded given n, and is at most 5.
And ^posj depends only on the source token xn, not the other sequence values.
Δi is not very large in practice, at most ±15%.
Using the same argument as the main text with an independent model: Var(∑nj=1^posjln^pjΔj)=(∑nj=1^pos2jln^pj2Var(Δj)).
We can argue that this variance will be tiny for contextual attention heads, bounding Var(Δj) using our bound for Δj.
We can do the same for higher-order terms of the Taylor expansion but it's probably unnecessary here since Δj is quite small.
Therefore I approximate: ∑nj=1pj∑nj=1^pj≈C.
This gives us: posi=pi^pi^pi∑nj=1^pj∑nj=1^pj∑nj=1pj≈(1+ln^piΔi+…)^posiC
Let's say that we restrict our sequences to ones where this approximation holds up to a factor of (1+ϵ), where it does so with high probability thanks to our variance bounds above. Then, ∑nj=1posjfxj≈∑nj=1fxj^posj(1+ln^pjΔj+…)C where the approximation holds up to (1+ϵ).
Now we can use the same variance argument from earlier, on the terms of this sequence. And we can calculate the variance up to some small tolerance assuming independence of xj because the (1+ϵ) approximation holds with sufficiently high probability.
So we can argue that ∑nj=1posjfxj will concentrate around a constant, or we can empirically observe this. Call this constant normh, noting that this constant will be different from in the main text.
Then we can approximate soft_prob[i]≈posifxi∑nj=1posjfxj≈^posi(1+ln^piΔi+…)fxinormh
where we can approximate Δi in terms of xi alone because |Wpos[i]|≈|Wpos[n]| for i where ^posi is non-fnegligible.
And then we can approximate ln^pi≈ln^pn, to get a correction term when n is quite large or quite small.
And now we have obtained an approximation that partially takes into account layer-norm which nonetheless allows for a decomposition into position-dependent and content-dependent terms.
Notably, for n near 500, ln^pn is close to 0, so that the approximation given in the main text works well.
Second layer attention patterns:
These are sample attention patterns on the Bible for the second layer, with n=1022:
To a large extent, the attention heads in the second layer are far better behaved than the attention heads in the first layer. There are lots of attention heads which seem almost entirely positional, with not even a content-dependent component.
These positional attention heads likely clean up noise from the first-layer contextual attention heads arising from variation in xn. They could also construct bags of bigrams by composing with the post-mlp detokenization head outputs, or compute how repetitive a text is by composing with duplicate token neurons. In general, primarily positional heads can be viewed as constructing bags-of-representations, out of any representations constructed by previous layers.
Contextual features:
The bag-of-tokens constrains how models can represent contextual features in the first-layer residual stream. Excluding Head 11, which behaves similarly to the contextual attention heads, only the contextual attention heads have access to tokens occuring more than ~7 tokens away.
Whatever a feature is, if we assume the bag of tokens approximation holds up, we must be able to understand first-layer contextual features through this lens.
For instance, 'Maths', and 'Topology', will naturally have a large overlap in their token distributions. And so by default we should expect them to lie close together in activation space, because their bag of tokens will overlap, assuming models don't give topology tokens abnormally large extended embeddings. Models will generally benefit from having close token distributions have close next token distributions, so there's no immediate reason for models to want large extended embeddings.
Bags of tokens are mostly translation invariant, so that contextual features will also tend to be translation invariant. Although for the first 100 tokens or so, models attend disproportionately to the <end-of-text> token, so that all contextual features will be partially ablated.
Arguably this ablation immediately implies the existence of weak linear representations. Assume that there is a cluster of 'bag-of-token' activations corresponding to 'Math'. Then if the model pays 10% to the <end-of-text> token, it will have 90% 'Math'. Whereas if it pays 50% to <end-of-text>, then it will have 50% 'Math' in its residual stream. So models will naturally want to handle a continuous range of strengths of activations.
Overview:
By decomposing the first-layer attention heads of GPT2-small into position-dependent and content-dependent components, I identify a striking pattern: six attention heads share a common positional structure, attending to previous tokens with an exponential decay over approximately 50 tokens. These heads also each exhibit consistent statistical regularities in how they process diverse inputs. These shared characteristics – both positional and statistical – allow us to unify the output-value (OV) circuits of all six heads into what I term a ”contextual circuit.”
I show that this unified circuit’s behaviour can be approximated as constructing a bag-of-tokens representation of the previous 50 tokens. This finding places important constraints on how first-layer contextual features must be structured: features such as sentiment, language, or writing style must emerge from distinct distributions of tokens, resulting in characteristic ”bags of tokens.”
To validate the bag-of-tokens model, I use it to predict contextual neurons – first-layer MLP neurons that detect specific contextual features like ”Spanish text” or ”Emotive language” – without needing to run the model or have access to a large text corpus. I demonstrate that our approximation accurately predicts these neurons’ activations when tested against positive and negative examples of their respective contextual features, providing strong evidence for our mechanistic understanding.
There are lots of contextual neurons which seem monosemantic, whereas others seem more complicated. A few example contextual neurons:
Decomposition of First Layer Attention Patterns:
I assume familiarity with the Mathematical Framework For Transformer Circuits.
I approximate the layer norm as linear because its scale consistently falls within 0.15–0.175. I refine this approximation in the appendix. When referring to WE and Wpos, I take these to be post layer norm approximation.
I write E for the token embedding WE, Q for the query matrix WQ, K for the key matrix WK,and P for the positional embedding Wpos.
I simplify notation by concatenating letters to denote matrix multiplication, applying transpositions when appropriate. For instance, EQKE refers to WEWQWKTWET.
Consider a sequence of tokens x1,x2,x3,…,xn, where n represents the current destination position.
For any position i≤n, the ith attention score measures the attention weight that xn places on xi.
This score combines contributions from both token embeddings (E) and positional embeddings (P) through query and key matrices:
attn_score[i]=EQKE[xn,xi]+EQKP[xn,i]+PQKP[n,i]+PQKE[n,xi]√dvalue
Technically, GPT-2 has bias terms on each of the queries and keys, and layer-norm has a weight and bias as well. These bias terms make the equations harder to read but don't cause any complications, so I omit them here.
The transformer applies a softmax operation to these attention scores, exponentiating each score for i≤n and normalizing the resulting vector. The exponentiated attention score decomposes into two independent components:
pi: depends exclusively on the token's position, i.
fxi: depends exclusively on the token's content, xi.
This decomposition can be expressed as:
eattn_score[i]=ePQKP+PQKE+EQKE+EQKP√dvalue=eEQKP[xn,i]+PQKP[n,i]√dvaluepi⋅ePQKE[n,xi]+EQKE[xn,xi]√dvaluefxi
I define the positional pattern as:
posi=pi∑nj=1pj
Given the destination token xn, this pattern can be computed independently of the other sequence tokens, representing the softmax of the position-dependent components of the attention score.
The positional pattern can be viewed as telling you how much each position in the sequence gets weighted, independent of the content of the token at that position. It is the attention pattern you would get if every token in the sequence were identical.
The positional pattern is slightly influenced by the destination token xn. However, for all attention heads in GPT-2-small’s first layer, the overall character of the positional pattern remains consistent regardless of xn.
The final softmax probabilities at position n are:
soft_prob[i]=pifxi∑nj=1pjfxj=posifxi∑nj=1posjfxj, for i≤n.
Positional pattern visualization:
Below is the positional pattern of each of the first layer attention heads for n=400, xn=' the', and a sample attention pattern obtained from running the model on the Bible. Each of the attention patterns of length 400 have been reshaped into a 20x20 grid for visualization purposes.
Behavioural Classification of First Layer Attention Heads:
Based on the positional patterns, and observation of the fxi values, we can classify the first layer attention heads into distinct behavioural groups:
Approximating Softmax Probabilities:
Continuing from soft_prob[i]=posifxi∑nj=1posjfxj, where n is the destination position:
This expression is difficult to work with because the denominator of soft_prob[i] depends on all tokens xj in the sequence, rather than solely on i and xi.
For contextual attention heads, however, the empirical value of the denominator ∑nj=1posjfxj depends mostly on the destination position n, rather than the particular xj values. When n is fixed, the denominator tends to concentrate around the same value for a variety of input sequences.
Fix xn=‘‘ the‘‘, assuming that the destination token doesn't affect the content-dependent component too much. Even if this assumption breaks down for certain contextual attention heads, it will at least tell us what happens when the destination token is a stopword.
For this fixed xn, I plotted the denominator of the softmax for various texts as a function of the destination position n:
The normalisation factor initially decays because the <end-of-text> token has a large content-dependent component. This reduces the impact of the contextual attention heads for the first ~50 tokens of the sequence.
Of the contextual heads, heads 6 and 8 vary the most in normalisation factor across different contexts. A lot of the variation can be attributed to differences in the percentage of keywords in the text. Head 6 emphasises keywords, whereas Head 8 emphasises stopwords. The Bible has more newline characters than other texts because it is composed of verses, so it has a below-average normalisation factor for head 6, and an above-average normalisation factor for head 8.
Independent model of normalisation factors:
For a fixed text, certain heads seem to have denominators which are a smoother function of n than others. What determines this?
As a simple model, imagine that each xj was independently drawn from some distribution with P(xj=i)=qi, with xn fixed. Then we have Var(∑nj=1posjfxj)=(∑nj=1pos2j)Var(fxi).
Because of the constraint that ∑nj=1posj=1, ∑nj=1pos2j becomes a measure of how spread out the positional pattern is. If an attention head pays 1m attention to m tokens, then ∑nj=1pos2j=1m. So, the more tokens a positional pattern is spread over, the lower the variance in its normalisation factor. The contextual attention heads are averaging over enough tokens that the normalisation factor varies smoothly.
Var(fxi) is determined by how much the content-dependent weighting varies between tokens. Head 6 and Head 9 have a very similar positional pattern, but head 6 emphasises keywords above stopwords, whereas head 9 attends fairly evenly to tokens. Therefore Head 6 has a higher variance than head 9.
Heads 3, 4, and 7 (the de-tokenisation heads) all have high variance in their content-dependent component, and a positional pattern which isn't spread over many tokens. Hence the confetti. Of course, this doesn't mean these heads are particularly hard to analyse, it's just not appropriate to model them as having a constant normalisation factor for fixed n.
The argument is:
Note the model doesn't have to work particularly hard to get the third property. Completely random content-dependent components would work.
Arguably a lot of the trouble in reading off algorithms from the weights is that models often learn algorithms that rely upon statistical regularities like the softmax denominator, which are invisible at the weights level. You need to inject some information about the input distribution to be able to read off algorithms, even if it's just a couple inputs to allow you to find constants of the computation.
For toy models, we can analyse contextual attention heads by partitioning inputs based on their softmax denominator values. While computing all the elements of this partition would be exponentially expensive, we can use concentration inequalities to bound the size of each partition - i.e., how likely inputs are to produce each range of denominator values. This lets us prove formal performance bounds without having to enumerate all possible inputs.
Contribution to Residual Stream from Contextual Attention Heads:
Once again, fix the destination token xn to be ' the'. Define normh as the median of the empirical normalization factors for head h.
The EVO circuit of head h can be approximated as:
∑ni=1soft_probh[i]WE[xi](VO)h=∑ni=1WE[xi](VO)hposhifhxi∑nj=1poshjfhxj≈∑ni=1WE[xi](VO)hposhifhxinormh
I approximate the contextual attention heads as each having the same positional pattern posi, as their true positional patterns are quite similar. Small differences in the positional pattern are unlikely to matter much because the EVO output of head h will concentrate around the expected value of WE[xi]VOhfhxinormh, regardless of the specific positional pattern.
Thus, the approximate output of the contextual attention heads can be expressed as:
∑nj=1posi∑h∈{0,2,6,8,9,10}WE[xi]VOhfhxinormh
Because I fixed the destination token xn, and normh is a constant,
∑h∈{0,2,6,8,9,10}WE[xi]VOhfhxinormh is a function of just xi and n. I refer to this as output[xi].
The output to the residual stream from the contextual attention heads is then approximately ∑nj=1posioutput[xi].
output[xi] can be interpreted as an extended embedding of the token xi, optimized for contextual classification. The combined output of all the contextual attention heads to the residual stream is approximately an exponentially decaying average of these extended embeddings over the previous ~50 tokens.
Technically, output[xi] depends on the destination position n. It would be very strange if these extended embeddings changed significantly depending on n, although it might be able to implement a correction term for layer norm, discussed in the appendix.
If these extended embeddings were one-hot encodings of tokens, and the average was uniform instead of an exponential decay, the output would be a bag-of-words, which is often used as input to classification algorithms like Naive Bayes.
In practice, the model only has 768 dimensions in its residual stream rather than 50257; which is the number of dimensions it would need to one-hot encode each token. This shouldn't cause too many issues for Naive Bayes classification, though, since contextual features are sparse. Nonetheless, it would be useful to know precisely how much this dimensionality restriction, and the fact that the average is an exponential decay rather than uniform, affects the model's ability to learn contextual features.
Contextual Neurons:
I define contextual neurons to be first-layer MLP neurons which have a significant component due to the EVO circuit of the contextual attention heads.
You can use the extended embedding approximation to find hundreds of different interesting contextual neurons by looking at the composition of the extended embeddings with the MLP input, after accounting for the MLP layer-norm. This is demonstrated in the accompanying Google Colab.
Below I show a couple of these contextual neurons to give an idea for the contexts the model finds important to represent. These contextual neurons are just reading off directions which have been constructed by the contextual attention heads, and in general, there's no reason to expect the directions constructed by the contextual attention heads to align with the neuron basis. Some of the contextual neurons seem to be monosemantic, whereas others seem to be complex to understand.
Each contextual neuron has a token contribution vector associated with it listing the composition of each token's extended embedding with the MLP neuron's input. The way to interpret this is that the transformer adds up an exponentially decaying average of these token contributions over the sequence. Positive token contributions update the neuron towards firing, and negative token contributions update the neuron against firing. Neurons might have an initial bias so that they only fire if the positive contributions get over a certain threshold.
Britain vs America (Neuron 300):
Top positive contributions:
Bottom negative contributions:
There is another neuron for detecting British contexts, neuron 704, which has an output in the opposite direction. Plausibly they cancel each other out to avoid their total output getting too large.
19-20th century conflict? (Neuron 1621):
Top positive contributions:
The years (1917,1918,1942,1943,1944) would indicate this is related to WW2/WW1, but 'Sherman' was a general during the American Civil War.
Bottom negative contributions:
These token contributions feel like a 'War' latent − '21st century' latent.
If this was the case, the top contributions and bottom contributions would be misleading, and wouldn't necessarily inform you about what is going on with the median contributions.
It would be useful if there was a way to automatically decompose the token contributions vector of a neuron into separate sub-latents. Potentially some sort of SVD technique could be helpful here.
Evaluating the approximation of contextual neurons:
Now that we have examples of neurons which we think are detecting certain specialised contexts, we can use these neurons to evaluate the above approximations.
The above gives an approximation for the EVO circuit of contextual attention heads. To estimate the normalization factor, I use a fixed control text with an average number of keywords. I approximate the PVO contribution using each head's positional pattern. For the other attention heads, I approximate the EVO circuit as attending solely to the current token. I calculate the E and P circuits directly.
It's easiest to show a variety of these approximations in the Google Colab, but here is a typical example, for neuron 1710, which fires on religious texts:
The approximation does tend to capture the overall fit of the graphs, but the approximations at individual points aren't very good. There is noise of about ±0.5.
The approximation tends to agree on the decision boundaries of neurons, at least, but it wouldn't satisfy someone looking for l2-distance bounds.
Assuming there is no error in the non-contextual heads or second layer-norm approximation, the main source of error would come from the xn = ' the' approximation. Most alternative xn seem to be consistent in normalization factor across a wide variety of contexts, even when the context in question is semantically related to xn. This would make our 'bag of words' a function of xn, at which point the claim about a 'bag of words' is that this doesn't vary too much with xn.
We'd need to know more about future layers to know how significant the error term is for understanding the broad behaviour of the model. For instance, future layers might use an averaging head to remove the noise from the neuron activations, at which point this approximation would be a good substitute. But if future layers make significant use of the exact neuron activations, this would require us to understand what's going on far better.
Regardless, validating that this mechanism makes more or less the same decisions as the model is exciting, because it feels like I have learnt at a high-level what the model is doing, even if the particulars would require a more lengthy explanation.
Rotary Embeddings:
Rotary embeddings can implement contextual attention heads by having queries and keys that mainly use the lower frequencies, as discussed in ROUND AND ROUND WE GO! WHAT MAKES ROTARY POSITIONAL ENCODINGS USEFUL?. At sufficiently low frequencies, the positional dependence will mostly drop out, which is what you want for an attention head that summarizes information.
Contextual attention heads are what rotary embeddings are designed to make easy to construct. It seems difficult to get as neat of a decomposition as when the positional embeddings lie in the residual stream, though.
Further Work:
Acknowledgements:
This post benefitted significantly from discussion with and feedback from Euan Ong and Jason Gross.
Appendix:
Analysis of Attention Patterns with Layer Normalisation:
I analyse how layer normalisation affects attention patterns by decomposing the attention weights into positional and content-dependent components. For tractability, fix a destination token xn='the' and examine how attention to previous tokens varies.
Let Qxn denote the query vector derived from the post-layer-norm embedding of xn, which we can compute exactly. The attention weight to a previous token xi can be decomposed into:
fxi=eQxnWTK√dmodelWE[xi]8|WE[xi]+Wpos[i]|where WE[xi] represents the token embedding and Wpos[i] the positional embedding.
and
pi=eQxnWTK√dmodelWpos[i]8|WE[xi]+Wpos[i]|
Empirically, token and positional embeddings are approximately orthogonal. This allows us to approximate:
|WE[xi]+Wpos[i]|≈√|WE[xi]|2+|Wpos[i]|2
|Wpos[i]|≈3.35 for i>100, and the positional pattern only attends to the previous roughly 50 tokens so that we can approximate the content-dependent component fxi by:
fxi=eQxnWTK√dmodelWE[xi]8√|WE[xi]|2+Wpos[n]2This approximation recovers the sole dependence of fxi on xi and n.
Positional component analysis:
The positional component of attention is given by:
pi=eQxnWTK√dmodelWpos[i]8|WE[xi]+Wpos[i]|
Once again, I construct a probabilistic model of what will occur.
To analyse this, introduce a mean approximation by averaging over token embeddings according to some distribution q:
^pi=eQxnWTK√dmodelWpos[i]8Eq[|WE+Wpos[i]|]
The relationship between actual and mean components can be written as:
pi=(^pi)Eq[|WE+Wpos|]|WE[xi]+Wpos[i]|
Using a Taylor expansion around the mean component:
pi=^pi(1+ln(^pi)Δi+ln(^pi)2Δ2i2+…)
where:
Δi=Eq[|WE+Wpos|]−|WE[xi]+Wpos[i]||WE[xi]+Wpos[i]|
This represents the relative deviation from mean normalisation.
As before, define posi=pi∑nj=1pj.
Unlike before, pi now depends on xi, and posi depends on all terms in the sequence.
To handle this, we use a similar argument to the normalisation factor:
n∑j=1pj=n∑j=1^pj(1+ln^pjΔj+ln^pj2Δ2j2+…)where:ln^pj=QxnWTK√dmodelWpos[j]8Eq[|WE+Wpos[j]|]
Importantly, ln^pn varies significantly with n. For n=500, this term is about 0, whereas for n=1000, it's about 5. This indicates the model is more sensitive to layer normalization further from the center of the sequence. At the center, there is very little distortion from layer norm.
In any case, |ln^pi| is bounded given n, and is at most 5.
Then:
∑nj=1pj∑nj=1^pj≈1+n∑j=1^posjln^pjΔjwhere ^posj=^pj∑nj=1^pj
And ^posj depends only on the source token xn, not the other sequence values.
Δi is not very large in practice, at most ±15%.
Using the same argument as the main text with an independent model:
Var(∑nj=1^posjln^pjΔj)=(∑nj=1^pos2jln^pj2Var(Δj)).
We can argue that this variance will be tiny for contextual attention heads, bounding Var(Δj) using our bound for Δj.
We can do the same for higher-order terms of the Taylor expansion but it's probably unnecessary here since Δj is quite small.
Therefore I approximate:
∑nj=1pj∑nj=1^pj≈C.
This gives us:
posi=pi^pi^pi∑nj=1^pj∑nj=1^pj∑nj=1pj≈(1+ln^piΔi+…)^posiC
Let's say that we restrict our sequences to ones where this approximation holds up to a factor of (1+ϵ), where it does so with high probability thanks to our variance bounds above. Then,
∑nj=1posjfxj≈∑nj=1fxj^posj(1+ln^pjΔj+…)C where the approximation holds up to (1+ϵ).
Now we can use the same variance argument from earlier, on the terms of this sequence. And we can calculate the variance up to some small tolerance assuming independence of xj because the (1+ϵ) approximation holds with sufficiently high probability.
So we can argue that ∑nj=1posjfxj will concentrate around a constant, or we can empirically observe this. Call this constant normh, noting that this constant will be different from in the main text.
Then we can approximate soft_prob[i]≈posifxi∑nj=1posjfxj≈^posi(1+ln^piΔi+…)fxinormh
where we can approximate Δi in terms of xi alone because |Wpos[i]|≈|Wpos[n]| for i where ^posi is non-fnegligible.
And then we can approximate ln^pi≈ln^pn, to get a correction term when n is quite large or quite small.
And now we have obtained an approximation that partially takes into account layer-norm which nonetheless allows for a decomposition into position-dependent and content-dependent terms.
Notably, for n near 500, ln^pn is close to 0, so that the approximation given in the main text works well.
Second layer attention patterns:
These are sample attention patterns on the Bible for the second layer, with n=1022:
To a large extent, the attention heads in the second layer are far better behaved than the attention heads in the first layer. There are lots of attention heads which seem almost entirely positional, with not even a content-dependent component.
These positional attention heads likely clean up noise from the first-layer contextual attention heads arising from variation in xn. They could also construct bags of bigrams by composing with the post-mlp detokenization head outputs, or compute how repetitive a text is by composing with duplicate token neurons. In general, primarily positional heads can be viewed as constructing bags-of-representations, out of any representations constructed by previous layers.
Contextual features:
The bag-of-tokens constrains how models can represent contextual features in the first-layer residual stream. Excluding Head 11, which behaves similarly to the contextual attention heads, only the contextual attention heads have access to tokens occuring more than ~7 tokens away.
Whatever a feature is, if we assume the bag of tokens approximation holds up, we must be able to understand first-layer contextual features through this lens.
For instance, 'Maths', and 'Topology', will naturally have a large overlap in their token distributions. And so by default we should expect them to lie close together in activation space, because their bag of tokens will overlap, assuming models don't give topology tokens abnormally large extended embeddings. Models will generally benefit from having close token distributions have close next token distributions, so there's no immediate reason for models to want large extended embeddings.
Bags of tokens are mostly translation invariant, so that contextual features will also tend to be translation invariant. Although for the first 100 tokens or so, models attend disproportionately to the <end-of-text> token, so that all contextual features will be partially ablated.
Arguably this ablation immediately implies the existence of weak linear representations. Assume that there is a cluster of 'bag-of-token' activations corresponding to 'Math'. Then if the model pays 10% to the <end-of-text> token, it will have 90% 'Math'. Whereas if it pays 50% to <end-of-text>, then it will have 50% 'Math' in its residual stream. So models will naturally want to handle a continuous range of strengths of activations.