The two small models are not really significantly different from each other (p=0.04).
This means the tasks at hand are too hard for both small models, so neither of them can learn them really well, so the noise ends up being larger than the slope.
As others have noted, we are looking at sort of sigmoidal curves, and a different one for each task. Performance will plateau once it approaches the lowest possible error rate (Bayes error rate or limit of the model paradigm). It is known that performance often sharply increases with model size at some point (once ...
Yeah, I don't find a linear regression on pairs of models to be all that informative:
the parameterization as % is misleading, squashing differences
I cannot access your wandb, btw. It seems to be private.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
You can get an idea of a pre-trained GPT-3's sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amoun...
It's perfectly aligned with Microsoft's viral marketing scheme.