This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I'm most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I'm not sure I understand the authors' conclusions: they state (3.2.3) this "suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning." I don't see any evidence of that in the paper!
In particular, 3.2.3 and figure 7's categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it'll propagate and screw you up. You can see, e.g., in figure 7's five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don't see what any of this has to do with planning and composing! In particular I don't see any measurement of something like "set up a totally wrong plan for multiplying numbers" or "fail to compose all the individual digit computation-steps into the final answer-concatenation step". Such errors might exist, but the paper doesn't give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It's possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let's look at it from another angle: how different is this from humans? What's impressive to me about this result is that it is quite different. I was expecting to say something like, "oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it's OK to expect some mistakes". But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don't always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it'd be nice to reproduce their results to confirm. If they're to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we'll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I'm skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But "number of times you saw the whole data set" seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.
I think the big implication for now is that the scaling hypothesis for LLMs, at least if we require them to be bounded scaling, is probably false for far more scaling effort than we realized, and this extends AI timelines by quite a bit.
Am I missing something or is GPT-4 able to do Length 20 Dynamic Programming using a solution it described itself very easily?
https://chat.openai.com/share/8d0d38c0-e8de-49f3-8326-6ab06884df90
We have 100k context models and several OOMs more FLOPs to throw at models, I couldn't see a reason why autoregressive models were limited in a substantial way given the evidence in the paper
To a non-trivial extent, it vindicates the LLM skeptics of recent fame, like Gary Marcus and Yann Lecun, and generally makes the path for LLMs to be much more constrained in capabilities than we used to believe.
This is both good and bad:
The biggest good thing about this, combined with the twitter talk on LLMs, is that makes timelines quite a bit longer. In particular, Daniel Kokotajlo's model becomes very difficult to sustain without truly ludicrous progress and switching to other types of AI.
The biggest potentially bad thing is that algorithmic progress, and to a lesser extent a change of paradigms becomes more important, and this complicates AI governance, because any adversarial pressure on LLMs is yet another force on AI progress, and while I don't subscribe to standard views on what will happen as a result of that, it does complicate AI governance.
I thought it'd be especially interesting to get critiques/discussion from the LW crowd, because the claims here seem antithetical to a lot of the beliefs people here have, mostly around just how capable and cognizant transformers are/can be.
The authors show that transformers are guaranteed to suffer from compounding errors when performing any computation with long reasoning chains.
From the abstract, "In an attempt to demystify Transformers, we investigate the limits of these models across three representative compositional tasks—multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem solving skills"