I want to generally encourage this kind of experiment-and-publish-quickly project. This might require a post of its own, but as someone with a background in both hacking and entrepreneurship, this kind of quick feedback loop is, in my opinion, an incredible strength of both, and I hope can be used to accelerate scientific progress, which is exactly what we need in alignment.
This post was written in a rush and represents a few hours of research on a thing I was curious about, and is an exercise in being less of a perfectionist. I'd love to see someone build on this work! Thanks a lot to Wes Gurnee for pairing with me on this
Tokens are weird, man
Introduction
A particularly notable observation in interpretability in the wild is that part of the studied circuit moves around information about whether the indirect object of the sentence is the first or second name in the sentence. The natural guess is that heads are moving around the absolute position of the correct name. But even in prompt formats where the first and second names are in the different absolute positions, they find that the informations conveyed by these heads are exactly the same, and can be patched between prompt templates! (credit to Alexandre Variengien for making this point to me).
This raises the possibility that the model has learned what I call emergent positional embeddings - rather than representing "this is the token in position 5" it may represent "this token is the second name in the sentence" or "this token is the fourth word in the sentence" or "this is the third sentence in the paragraph" etc. Intuitively, models will often want to do things like attend to the previous word, or the corresponding word in the previous sentence, etc - there are lots of things it will plausibly want to do that are natural in some emergent coordinate scheme that are unnatural in the actual token coordinate scheme.
I was curious about this, and spent an afternoon poking around with Wes Gurnee at whether I could convince myself that these emergent positional embeddings were a thing. This post is an experiment: I'm speedrunning a rough write-up on a few hours of hacky experiments, because this seemed more interesting to write-up than not to, and I was never going to do the high-effort version. Please take all this with a mountain of salt, and I'd love to see anyone build on my incredibly rough results - code here.
Experiments
You can see some terrible code for these experiments here. See the Appendix for technical details
I wanted to come up with the dumbest experiment I could that could shed light on whether this was a thing. One thing that models should really care about is the ability to attend to tokens in the previous word. Words can commonly range from 1 to 3 tokens (and maybe much longer for rare or mispelt words) so this is naturally done with an emergent scheme saying which word a token is part of.
My experiment: I took prompts with a fixed prefix of 19 tokens and then seven random lowercase English words of varying token length, like
token|izer| help| apple| dram|at|isation| architecture| sick| al|p|aca
. I ran GPT-2 Small on this, look the residual stream after layer 3 (33% of the way through the model) and then trained a logistic regression probe on the residual stream of the token at the end of each word to predict which word it was in.This is the key plot, though it takes a bit of time to get your head around. The x axis is the absolute position of the token in the prompt and the row is the ground truth of the word index. The bar for each absolute position and row shows the distribution of guesses given on the probe validation set. The colours correspond to the seven possible indices (note that the legend is not in numerical order, sigh).
For example: take the third bar in the second row (index=1, abs_pos=22). This is mostly red (index = 1, correct!), with a bit of blue at the bottom (index = 0, incorrect) and a bit of green at the top (index = 2, incorrect). In contrast, the bar in the row below (second bar in the third row, index=2, abs_pos=23) is mostly green, showing that despite having the same absolute position, the probe can tell that it's mostly index=2, with a bit of red error (index=1) and purple error (index=3)
Key observations from this plot:
Conceptual Subtleties + Commentary
Why might models care about emergent positional embeddings at all? One of the weirdnesses of transformers is that, from the perspective of attention, every previous token position looks similar regardless of how far back it is - they're just as easy to attend to! The standard way of dealing with this is various hacks to hard-code knowledge of positional info, like rotary, or absolute positional embeddings. But tokens are a pretty weird format, different things of the same conceptual "length" can get split into wildly varying numbers of tokens, eg " Alexander" -> " Alexander" while " Neel" -> " Ne" "el" (apparently Neel isn't as popular a name :'( ).
It's also plausible that being able to move around creative positional schemes is just much more efficient than actual token values. In indirect object identification part of the circuit tracks the position of the indirect object (two possible values, 1 bit) and the token value (hundreds to thousands of possible names!), the position just seems vastly more efficient!
Why should we care if this happens? Honestly I mostly think that this would just be cool! But it seems pretty important to understand if it does occur, since I expect this to be a sizable part of what models are doing internally - moving these around in creative ways, and computing more complex emergent positional schemes. If we don't understand the features inside the model or the common motifs, it seems much harder to understand what's actually going on. And it's plausible to me that quite a lot of sophisticated attention head circuitry looks like creative forms of passing around emergent positional embeddings. Also, just, this was not a hypothesis I think I would have easily naturally thought of on my own, and it's useful to know what you're looking for when doing weird alien neuroscience.
Models are probably bad at counting: One observation is that my probe performance gets much worse as we get to later words. I'm not confident in why, but my weak intuition is that counting in this organic, emergent way is just pretty hard! In particular, I'd guess that heads need an "anchor" nearby like a full stop or newline or comma such that they count from there onwards. Eg they have attn score 3 to the full stop and then 1 to each token beginning with a space, -inf to everything else. And the OV just accumulates things beginning with a space. This creates big difference for early words but washes out later on.
This hypothesis predicts that models do not do anything like tracking "I am word 98" etc, but rather "I am the third word in the fifth sentence" etc. Since I imagine models mostly care about local attention to recent words/sentences/etc this kind of nearby counting seems maybe sufficient.
What are the limitations of my experiment
Next Steps
Natural next experiments to run
Finding a circuit!
Appendix: Technical details of the experiment
Meta - I was optimising for moving fast and gettingsomeresults, which is why the below are extremely hacky. See my terrible code for more.