Since you wrote about OpenAI's "AI and compute", you should take a look at https://www.lesswrong.com/posts/wfpdejMWog4vEDLDg/ai-and-compute-trend-isn-t-predictive-of-what-is-happeningfor.
You should probably also be tracking kind of parameter. I see you have Switch and Gshard in there, but, as you can see in how they are visibly outliers, MoEs (and embeddings) use much weaker 'parameters', as it were, than dense models like GPT-3 or Turing-NLG. Plotting by FLOPS would help correct for this - perhaps we need graphs like training-FLOPS per parameter? That would also help correct for comparisons across methods, like to older architectures such as SVMs. (Unfortunately, this still obscures that the key thing about Transformers is better scaling laws than RNNs or n-grams etc, where the high FLOPS-per-parameter translates into better curves...)
Thank you for the feedback, I think what you say makes sense.
I'd be interested in seeing whether we can pin down exactly in what sense are Switch parameters "weaker". Is it because of the lower precision? Model sparsity (is Switch sparse on parameters or just sparsely activated?)?
What do you think, what typology of parameters would make sense / be useful to include?
It's not the numerical precision but the model architecture being sparse such that you only active a few experts at runtime, and only a small fraction of the model runs for each input. It may be 1.3t parameters or whatever, but then at runtime, only, I dunno, 20b parameters actually compute anything. This cheapness of forward passes/inferencing is the big selling point of MoE for training and deployment: that you don't actually ever run 1.3t parameters. But it's hard for parameters which don't run to contribute anything to the final result, whereas in GPT-3, pretty much all of those 175b parameters can participate in each input. It's much clearer if you think about comparing them in terms of FLOPS at runtime, rather than static parameter counts. GShard/Switch is just doing a lot less.
(I also think that the scaling curves and comparisons hint at Switch learning qualitatively worse things, and the modularity encouraging more redundancy and memorization-heavy approaches, which impedes any deeper abstractions or meta-learning-like capabilities that a deep dense model might learn. But this point is much more speculative, and not necessarily something that, say, translation researchers would care too much about.)
This point about runtime also holds for those chonky embeddings people sometimes bring up as examples of 'models with billions of parameters': sure, you may have a text or category embedding which has billions of 'parameters', but for any specific input, only a handful of those parameters actually do anything.
Pablo Villalobos and I have been working to compile a rough dataset of parameter counts for some notable ML systems through history.
This is hardly the most important metric about the systems (other interesting metrics we would like to understand better are training and inference compute , and dataset size), but it is nonetheless an important one and particularly easy to estimate.
So far we have compiled what it is (to our knowledge) the biggest dataset so far of parameter counts, with over a 100 entries.
But we could use some help to advance the project:
jaimesevillamolina at gmail dot com
or leave a comment in the spreadsheet.Thank you to Girish Sastry and Max Daniel for help and discussion so far!