In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
I disagree at least with the "almost completely" version of this claim:
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" representation back into the "magnitude" representation, and then simply adds them as normal.
Some evidence for this:
It seems like if you believe "when the operation you want to learn is a convolution operation, then you will learn the Fourier basis", you should also believe that you'll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that's false.
Fwiw, I agree that the algorithm is quite "mathematically natural" (indeed, one person came up with the algorithm independently, prompted by "how would you solve this problem if you were a Transformer"), though I feel like the "modular" part is pretty crucial for me (and the story I'd tell would be the one in Daniel's comment).
Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked).
I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers and and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I agree -- the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn't super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.
This is a neat observation, but I'm reminded of a story I was told about a math professor:
One day while in the middle of a long proof of an arcane theorem, the professor was stopped and questioned about a particular step by a student, who wondered what made that step true. The professor said "Its trivial!" then thought a bit more about the step, mumbled to himself "Wait, is it trivial?", and excused himself to step out of the hall and think. Ten minutes later, he comes back into the hall and declares the step was indeed trivial, and proceeds along with the proof.
This feels similar to me. Neel and Tom figure out this algorithm 9.5 months ago, and now mathematicians have just realized that indeed the algorithm is obvious and simple, and indeed the only way to the operation described when matrix multiplications come easy.
Not to say the insight is wrong, but I would be far more impressed if you were able to predict the algorithm a network does in advance through similar reasoning rather than a 9.5 month later justification.
Hi Garrett,
OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.
it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena.
....[but]... one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste.
And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn't any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say "I would be far more impressed if B" instead of "maybe that's progress towards B"?
(Also FWIW, Neel claims here that regarding the algorithm itself, another researcher he knows "roughly predicted this".)
I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment?
Sorry about that. On a re-read, I can see how the comment could be seen as snarky, but I was going more for critical via illustrative analogy. Oh the perils of the lack of inflection and facial expressions.
I think your criticisms of my thought in the above comment are right-on, and you've changed my mind on how useful your post was. I do think that lots of progress can be made in understanding stuff by just finding the right frame by which the result seems natural, and your post is doing this. Thanks!
My submission: when we teach modular arithmetic to people, we do it using the metaphor of clock arithmetic. Well, if you ignore the multiple frequencies and argmax weirdness, clock arithmetic is exactly what this network is doing! Find the coordinates of rotating the hour hand (on a 113-hour clock) x hours, then y hours, use trig identities to work out what it would be if you rotated x+y hours, then count how many steps back you have to rotate to get to 0 to tell where you ended up. In fairness, the final step is a little bit different than the usual imagined rule of "look at the hour mark where the hand ends up", but not so different that clock arithmetic counts as a bad prediction IMO.
Is this really an accurate analogy? I feel like clock arithmetic would be more like representing it as a rotation matrix, not a Fourier basis.
I agree a rotation matrix story would fit better, but I do think it's a fair analogy: the numbers stored are just coses and sines, aka the x and y coordinates of the hour hand.
Like, the only reason we're calling it a "Fourier basis" is that we're looking at a few different speeds of rotation, in order to scramble the second-place answers that almost get you a cos of 1 at the end, while preserving the actual answer.
Here's another way of looking at it which could be said to make it more trivial:
We can transform addition into multiplication by taking the exponential, i.e. x+y=z is equivalent to 10^x * 10^y = 10^z.
But if we unfold the digits into separate axes rather than as a single number, then 10^n is just a one-hot encoding of the integer n.
Taking the Fourier transform of the digits to do convolutions is a well-known fast multiplication algorithm.
Thanks for putting this so succintly! To add another subjective data point, I had very similar thoughts immediately after I first saw this work (and the more conceptual follow-up by Chugtai et al) a few months ago.
About "one-hotting being a significant transformation": I have a somewhat opposite intuition here and would say that this is also quite natural.
Maybe at first glance one would find it more intuitive to represent the inputs as a subset of the real numbers (or floats, I guess) and think of modular addition as some garbled version of the usual addition. But the group structure on a finite cyclic group and the vector space structure on the real number line are not really compatible, so I'm not sure that this actually makes that much sense bearing in mind that the model has to work mostly with linear transformations.
On the other hand, if such a representation was useful, the model could in principle learn an embedding which takes all the one-hot embedded inputs to the same one-dimensional space but with different lengths.
In fact, one-hotting is in a precise sense the most general way of embedding a given set in a vector space because it does not impose any additional linear relations (in mathematical jargon it's the free vector space generated by the set, and is characterized by the universal property of turning arbitary maps on the generating set into unique linear maps on the vector space). In this sense I'd view using a one-hot embedding as the natural way of designing the architecture if I don't want create a bias towards a particular linear representation.
As a side remark, although it's in a sense completely trivial, the "one-hotting construction" is also used as an important ingredient in many areas of mathematics. One example would be homology theory in algebraic topology, where one turns geometric/combinatorial objects into a vector space in this way and then does linear algebra on that space rather than working with the objects directly. Another example, closer to the problem discussed here, is turning a group into the corresponding group algebra in representation theory.
These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.
In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:
And
But the casual reader should use caution! It is in fact the case that "Inputs x,y are given as one-hot encoded vectors in Rp ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It is not trained to 'act' on the numbers x, y themselves.
(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live 'in the domain' (its almost like a dual point of view: The number x is not the size of the signal, but the position at which the input signal is non-zero).
So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the indicator functions t↦1x(t) and t↦1y(t). Here, t is something like the 'token variable' in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group Z/pZ and that we would like the network to learn to produce the function t↦1x+y(t) at its output neurons.
In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
And (as I distinctly remember being made to practically chant in an 'Analysis of Boolean Functions' class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.