These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.
In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:
This algorithm operates via using trig identities and Discrete Fourier Transforms to map , and then extracting
And
The model is trained to map to (henceforth 113 is referred to as )
But the casual reader should use caution! It is in fact the case that "Inputs are given as one-hot encoded vectors in ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given and as inputs. It is not trained to 'act' on the numbers themselves.
When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number be represented by its own magnitude. You instead have a situation in which and now really live 'in the domain' (its almost like a dual point of view: The number is not the size of the signal, but the position at which the input signal is non-zero).
So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the indicator functions and . Here, is something like the 'token variable' in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group and that we would like the network to learn to produce the function at its output neurons.
In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
And (as I distinctly remember being made to practically chant in an 'Analysis of Boolean Functions' class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.
Hi Garrett,
OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.
And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn't any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say "I would be far more impressed if B" instead of "maybe that's progress towards B"?
(Also FWIW, Neel claims here that regarding the algorithm itself, another researcher he knows "roughly predicted this".)