Duplicate token neurons in the first layer of GPT-2

Alex Gibson

Summary:

I examine duplicate token heads and the neurons they influence. I extract circuits the model uses to compute these neurons.

I discuss a neuron which activates on duplicate tokens, provided these duplicate tokens occur sufficiently far back in the sequence from the current token. I feel these neurons could potentially be useful in downstream induction tasks.

In a future post, I will build on the circuits discussed here to show how duplicate token heads can encode relative token information in pairs of near-identical sinusoidal positional neurons analogously to twisted pair encoding.

Duplicate token heads:

A duplicate token head is an attention head that attends almost all of its attention to tokens in the sequence identical to the current token. There are three heads in the first layer of GPT-2 Small that exhibit duplicate token behaviour:

Head 0.1	Attends to nearby tokens which are identical to the current token, with an exponential decay in its attention as repeated tokens get further away.
Head 0.5	Seems to attend roughly uniformly to identical tokens in the sequence, but with a small decay over time. Behaves differently on certain common tokens however, like '.', ',', and ' and'. These tokens seem to be grouped together.
Head 0.10	About 25% of it's attention is dedicated to repeat token behaviour, and it mimics head 1 in that it has an exponential decay in how much it will attend to repeat tokens, so only pays attention to nearby repeat tokens.

To analyse these heads, first fold in layer norm. Then center the means of each of and $W_{pos}$ . The layer norm scale for the token $tok$ at position $i$ is:

$l n_{scale} = \frac{\sqrt{d_{model}}}{| (W_{E} [tok] + W_{pos} [i] |}$

$W_{E} [tok]$ and $W_{pos} [i]$ both have a cosine similarity of ~ 0.05 on average. So we can approximate $| W_{E} [tok] + W_{pos} [i] |$ by $\sqrt{W_{E} [t o k]^{2} + W_{p o s} [i]^{2}}$ .

For $i > 100$ , $W_{pos} [i]$ is about $3.35$ , so I use this as an approximation.

So our layer norm scale is approximately $\frac{\sqrt{d_{model}}}{\sqrt{W_{E} [tok]^{2} + {3.35}^{2}}}$

Then the output from a duplicate token head with a repeat token $tok$ at positions $i_{1}, i_{2}, . . ., i_{n}$ is approximately $l n_{scale} [tok] \sum_{j = 1}^{n} w_{j} (W_{E} [tok] + W_{pos} [i_{j}])$

where $w_{j}$ is the attention paid to $i_{j}$ . If we assume that $\sum_{j = 1}^{n} w_{j} = 1$ , then we get an output of $l n_{scale} [tok] (W_{E} [tok] + \sum_{j = 1}^{n} w_{j} W_{pos} [i_{j}])$ .

So duplicate token heads like head $0.5$ and $0.1$ effectively act as a token embedding, together with some additional positional information.

Duplicate token Neurons:

How does this positional information get used by the model? A start is to look at MLP neurons in the first layer which use the positional information gathered by the duplicate heads. I look at head 5 in this post.

To understand which neurons use the positional information from the VO circuit of head 5, we first assume that the MLP layer norm is linear. This is somewhat justified because in practice the MLP layer-norm scale factor is pretty stable around $1.2.$ Then we expect neurons which use this positional information to have a large average contribution from $W_{pos} V O_{5} {mlp}_{in} [:, neuron]$ . I call this the PVO contribution of head $5.$

Here is a graph of torch.norm( $W_{pos} V O_{5} {mlp}_{in}$ , dim=0), where the 3072 first-layer MLP neurons have been rearranged into a 48x64 grid:

Just looking at the norm is a crude way of evaluating the influence of the 5th head VO circuit. But I just want an easy heuristic for finding interesting neurons, at which point the analysis of these neurons will be more precise. For now, I don't mind if the list of interesting neurons is not exhaustive.

The neurons with a greater than 30 norm have PVO contributions graphs as follows:

These seem to be linear graphs passing through the origin at about position 500.

Suppose a token occurs at position 500 and at position 200. Then at position 500, head 5 will attend 50% to position 500, and 50% to position 200, so will have a positional contribution of $\frac{l n_{scale} [t o k]}{m l p_{l n_{scale}}} \frac{W_{pos} [500] V O_{5} m l p_{i n} [:, neuron] + W_{pos} [200] V O_{5} m l p_{i n} [:, neuron]}{2} = \frac{(PVO) [500] + (PVO) [200]}{2}$

A non-duplicate token at position 500, however, will have a positional contribution of just $(PVO contribution) [500]$ .

So if $(PVO contribution) [i]$ is decreasing as a function of $i$ for a particular neuron, then a duplicate token will have a higher activation than a non-duplicate token at the same position.

However, this alone is insufficient to detect duplicate tokens, because a non-duplicate token at position 100 will have a higher PVO contribution than the duplicate token at position 500 with a repeat at position 200.

To understand the neurons above better, we can zoom in on neuron 1168.

Here is a graph of how much each attention head contributes to the output of neuron 1168, throughout a verse from the bible. We assume again that the MLP layer-norm is linear so that the contributions from each attention head can be considered separately.

Head 5 indeed has the downward slope we expect, but the other heads seem to cancel this downward slope out.

Head 3 attends almost all of its attention to the previous few tokens in the sequence. So its positional contribution is mostly determined by the current position in the sequence.

In fact, most attention heads in the first layer attend just to the previous 30 tokens. The two exceptions are head 11, and head 5. Head 11 attends close to uniformly to each token in the sequence, and head 5 of course can attend to far away tokens if they are duplicates of the current token.

So we can approximate the PVO contributions from heads other than heads 5 and 11 just by using $W_{pos} [current_position] V O_{head}$ . For head 11, using the mean PVO contributions below the current index should be sufficient. We can also account for the direct circuit from $W_{pos}$ .

If a duplicate token neuron pays attention $w_{j}$ to duplicate tokens at positions $i_{1}, . . . ., i_{n}$ from $i_{n}$ , then its positional contribution to neuron 1168 from position $i_{n}$ will be approximately:

$\sum_{j = 1}^{n} (w_{j} (PVO contribution)_{5} [i_{j}])$ + (PVO Contribution from components other than head 5 at position i_n).

Assuming that $\sum_{j = 1}^{n} w_{j} = 1$ , this is equal to:

$\sum_{j = 1}^{n} (w_{j} ((PVO contribution)_{5} [i_{j}] - (PVO contribution)_{5} [i_{n}]))$ + (PVO Contribution from all components at position $i_{n}$ )

Below is the graph of the combined PVO contributions from all the components:

You can see that after the first 200 tokens, the combined PVO contribution averages about -0.2.

This can be thought of as the positional contribution for non-duplicate tokens.

Duplicate tokens obtain an additional term of $\sum_{j = 1}^{n} w_{j} ((PVO contribution)_{5} [i_{j}] - (PVO contribution)_{5} [i_{n}]) .$

If we assume that $w_{j} = \frac{1}{n}$ , as it approximately is for head $5$ , then the additional term would correspond to $\sum_{j = 1}^{n} \frac{(PVO contribution)_{5} [i_{j}] - (PVO contribution)_{5} [i_{n}]}{n}$ .

If $(PVO contribution)_{5}$ is indeed linear, then the additional term is proportional to $\sum_{j = 1}^{n} \frac{([i_{j}] - [i_{n}])}{n}$ .

So for neuron 1168, the further away the average duplicate of the current token is, the greater the positional contribution to the neuron.

If duplicate tokens are an average of 100 tokens away, they will lead to an increase of about 1.0 in the activation of the neuron. So neuron 1168 will only tend to activate after at least the first 100 tokens, as otherwise, it can't overcome the PVO contribution barrier.

Here are some of the top activations for neuron 1168 on Neuronpedia:

Common words that get repeated too often don't activate, because they occur too close to each other. It tends to activate on words which repeat more occasionally, or key terms that might occur once a paragraph.

I haven't explored how the token embedding influences these neurons, so that could also play a part in their activations.

I also haven't looked at how these neurons get used / if they do at all.

If we did a circuits analysis on these duplicate token neurons, we would conclude that head 5, and head 1 potentially, were the main contributors to them. A mean ablation which preserved the position of the final token of the input would render the positional contribution of all the other heads invisible. I don't know if this is desirable or not.

Future work:

Look at how these neurons, or the input directions associated with them, get used in downstream computations. I feel like they could be useful for induction potentially.
Investigate the other duplicate token heads - I have made partial progress on this and it seems really interesting. Head 1 seems to attend to nearby duplicate tokens and encode more precise information about their relative position. Could potentially be useful for the IOI task.