All of batterseapower's Comments + Replies

Hope to see you all there! We normally get 12-15 attendees and always spirited discussion :-)

We have 8 RSVPs right now. More are welcome :-)

Last time we had about 12 people - hope we can get similar numbers for this one :-) - Max

Isn't GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.

(Of course, this doesn't imply that the post is wrong. I'm sure it's possible to train a radically larger GPT right now. It's just that the relevant bound is the availability of data, not of compu... (read more)

6gwern
The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what's going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I'd also note that I don't see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.) I also think that given everything we've learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper's benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.
8Andy Jones
It's indeed strange no-one else has picked up on this, which makes me feel I'm misunderstanding something. The breakdown suggested in the scaling law does imply that this specific architecture doesn't have much further to go. Whether the limitation is in something as fundamental as 'the information content of language itself', or if it's a more-easily bypassed 'the information content of 1024-token strings', is unclear. My instinct is for the latter, though again by the way no-one else has mentioned it - even the paper authors - I get the uncomfortable feeling I'm misunderstanding something. That said, being able to write that quote a few days ago and since have no-one pull me up on it has increased my confidence that it's a viable interpretation.