Really nice summarisation of the confusion. Re: your point 3, this point makes "induction heads" as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback--do you think there may be induction-y heads that simply don't activate on random sequences due to the out-of-distribution nature of them?
I'll just preregister that I bet these weird tokens have very large norms in the embedding space.
Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first attention layer right?
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:
Re: GPT-Neo being weird, one of the colabs in the original logit lens post shows that logit lens is pretty decent for standard GPT-2 of varying sizes but basically useless for GPT-Neo, i.e. outputs some extremely unlikely tokens for every layer before the last one. The bigger GPT-Neos are a bit better (some layers are kinda intepretable with logit lens) but still bad. Basically, the residual stream is just in a totally wacky basis until the last layer's computations, unlike GPT-2 which shows more stability (the whole reason logit lens works).
One weird thing I noticed with GPT-Neo 125M's embedding matrix is that the input static embeddings are super concentrated in vector space, avg. pairwise cosine similarity is 0.960 compared to GPT-2 small's 0.225.
On the later layers not doing much, I saw some discussion on the EleutherAI discord that probes can recover really good logit distributions from the middle layers of the big GPT-Neo models. I haven't looked into this more myself so I don't know how it compares to GPT-2. Just seems to be an overall profoundly strange model.
Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs
GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.
cf. https://arxiv.org/abs/2407.14662