Aren’t we leaving performance on the table? Yes! We are. But I think that’s fine! There’s always a tradeoff here. E.g. quantization. It’s strictly worse to use lower-precision! But we do it to optimize TCO of the system.
But we can use $INSERT_TECHNIQUE to make models cheaper! Yes, but they should scale for all of these (distillation, quantization, etc.). So we should be using all techniques to make our models easier to serve, and also training them longer.
If you're training a LLM with the goal of deploying it to users, you should prefer training a smaller model well into the diminishing returns part of the loss curve.
To reiterate my points from Twitter: Timbers is answering a question that is irrelevant and not the one he claims to be answering (and I am annoyed that some throwaway comments at the end is still all the acknowledgement he gives to the fact that the whole post is irrelevant). No one cares about the size of the raw trained model, if they care about inference; they care about the size of the best model they can obtain which fits in their inference envelope, which may be obtainable from a much larger raw model, since those models will reach much lower loss (by definition) and thus can be a big win after some modest damage from the $INSERT_TECHNIQUE. (If you can train the compute-optimal model to 10% lower loss than the overtrained small model, and then lose 1-2% to quantization, well, that's still a big win of 8% for free by ignoring his advice.)
Timbers continues to ignore the many scaling papers on sparsification and distillation and quantization, which are not hard to find and have been much discussed among those who care about TCO. So his conclusion is at best unproven: it is not obvious that you should prefer training a small model far beyond compute-optimal instead of a large compute-optimal model that you then do (often extremely easy) optimizations to like quantization. If he was doing calculations on that, and even going beyond that to consider questions about continual learning per jcannell or how well it tolerates downstream finetuning or RLHF training (larger=better presumably, so have to consider that), that would be interesting. But he's not.
This point is semi-correct now, but mostly incorrect for future systems. A larger model learns faster per data point which is increasingly important as we move towards AGI. If you want a system which has mostly memorized the internet then sure - overtraining a small model now makes sense. If you want a system that can rapidly continuously transfer learn from minimal amounts of new data to compete with smart humans, then you probably want something far larger than even the naive[1] chinchilla optimum.
Naive in the sense that it only considers total compute cost of training, without considering future downstream data efficiency. ↩︎
Fwiw in the conversations I’m in (in the alignment scene in the Bay Area) this point is widely understood.
This is a big reason for why GPT4 is likely not that big but instead trained on much more data :)
Finbarr Timbers makes a point, obvious in retrospect, but which many people, including people forecasting AI timeline, seem to miss: since training cost is amortized over inference, optimal training depends on expected amount of inference. Both scaling laws from OpenAI and DeepMind assume zero (or negligible) inference, which is obviously incorrect. Any forecasting using scaling laws similarly is suspect and should be revised.