QAPR 5: grokking is maybe not *that* big a deal?

A more factual and descriptive phrase for "grokking" would be something like "eventual recovery from overfitting".

Ooh I do like this. But it's important to have a short handle for it too.

I've been using "delayed generalisation", which I think is more precise than "grokking", places the emphasis on the delay rather the speed of the transition, and is a short phrase.

[-]Eric J. Michaud2yΩ7100

Small point/question, Quintin -- when you say that you "can fully avoid grokking on modular arithmetic", in the colab notebook you linked to in that paragraph it looks like you just trained for 3e4 steps. Without explicit regularization, I wouldn't have expected your network to generalize in that time (it might take 1e6 or 1e7 steps for networks to fully generalize). What point were you trying to make there? By "avoid grokking", do you mean (1) avoid generalization or (2) eliminate the time delay between memorization and generalization. I'd be pretty interested if you achieved (2) while not using explicit regularization.

[-]Quintin Pope2yΩ130

I mean (1). You can see as much in the figure displayed in the linked notebook:

Note the lack of decrease in the val loss.

I only train for 3e4 steps because that's sufficient to reach generalization with implicit regularization. E.g., here's the loss graph I get if I set the batch size down to 50:

Setting the learning rate to 7e-2 also allows for generalization within 3e4 steps (though not as stably):

The slingshot effect does take longer than 3e4 steps to generalize:

[-]Eric J. Michaud2y10

Huh those batch size and learning rate experiments are pretty interesting!

[-]Rohin Shah2y*Ω220

~~Honestly I'd be surprised if you could achieve (2) even with explicit regularization, specifically on the modular addition task.~~

(You can achieve it by initializing the token embeddings to those of a grokked network so that the representations are appropriately structured; I'm not allowing things like that.)

EDIT: Actually, Omnigrok does this by constraining the parameter norm. I suspect this is mostly making it very difficult for the network to strongly memorize the data -- given the weight decay parameter the network "tries" to learn a high-param norm memorizing solution, but then repeatedly runs into the parameter norm constraint -- and so creates a very strong reason for the network to learn the generalizing algorithm. But that should still count as normal regularization.

[-]RobertKirk2yΩ110

If you train on infinite data, I assume you'd not see a delay between training and testing, but you'd expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

[-]Rohin Shah2yΩ330

I expect a delay even in the infinite data case, I think?

Although I'm not quite sure what you mean by "infinite data" here -- if the argument is that every data point will have been seen during training, then I agree that there won't be any delay. But yes training on the test set (even via "we train on everything so there is no possible test set") counts as cheating for this purpose.

[-]Lauro Langosco2yΩ476

Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying 'bricks fall because it's in a brick's nature to move towards the ground': both are repackaging an observation as an explanation.

The idea is that the framing 'learning at different speeds' lets you frame grokking and double descent as the same thing. More like generalizing 'bricks move towards the ground' and 'rocks move towards the ground' to 'objects move towards the ground'. I don't think we make any grand claims about explaining everything in the paper, but I'll have a look and see if there's edits I should make - thanks for raising these points.

[-]wassname2y40

The above two papers suggest grokking is a consequence of moderately bad training setups. I.e., training setups that are bad enough that the model starts out by just memorizing the data, but which also contain some sort of weak regularization that eventually corrects this initial mistake.

Sorry if this is a silly question, but from an ML-engineer perspective. Can I expect to achieve better performance by seeking grokking (large model, large regularisation, large training time) vs improving the training setup.

And if the training setup is already good, I shouldn't expect grokking to be possible?

[-]Quintin Pope2y20

I don't think that explicitly aiming for grokking is a very efficient way to improve the training of realistic ML systems. Partially, this is because grokking definitionally requires that the model first memorize the data, before then generalizing. But if you want actual performance, then you should aim for immediate generalization.

Further, methods of hastening grokking generalization largely amount to standard ML practices such as tuning the hyperparameters, initialization distribution, or training on more data.

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]gpt_4_all2y10

Hey! Out of curiosity, has grokking been observed in any non-algorithmic dataset to date, or just these toy, algorithmic datasets?

[-]Quintin Pope2y20

It can be induced on MNIST by deliberately choosing worse initializations for the model, as Omnigrok demonstrated.

[-]gpt_4_all2y10

Got it, thanks!

^{^}

If the goal is to better understand how neural networks pick between out of distribution generalizations, then a solution with zero generalization capacity at all (memorization) feels like a degenerate case.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

114

QAPR 5: grokking is maybe not that big a deal?

114

Ω 50

114

Ω 50

Introduction

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A Mechanistic Interpretability Analysis of Grokking

Towards Understanding Grokking: An Effective Theory of Representation Learning

Omnigrok: Grokking Beyond Algorithmic Data

A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks

Unifying Grokking and Double Descent

Grokking of Hierarchical Structure in Vanilla Transformers

Conclusion

Future