How much progress in ML depends on algorithmic progress, scaling compute, or scaling relevant datasets is relatively poorly understood. In our paper, we make progress on this question by investigating algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision.
Using a dataset of a hundred computer vision models, we estimate a model—informed by neural scaling laws—that enables us to analyse the rate and nature of algorithmic advances. We use Shapley values to produce decompositions of the various drivers of progress computer vision and estimate the relative importance of algorithms, compute, and data.
Our main results include:
- Every nine months, the introduction of better algorithms contributes the equivalent of a doubling of compute budgets. This is much faster than the gains from Moore’s law; that said, there's uncertainty (our 95% CI spans 4 to 25 months)

- Roughly, progress in image classification has been ~45% due to the scaling of compute, ~45% due to better algorithms, ~10% due to scaling data

- The majority (>75%) of algorithmic progress is compute-augmenting (i.e. enabling researchers to use compute more effectively), a minority of it is data-augmenting

In our work, we revisit a question previously investigated by Hernandez and Brown (2020), which had been discussed on LessWrong by Gwern, and Rohin Shah. Hernandez and Brown (2020) re-implement 15 open-source popular models and find a 44-fold reduction in the compute required to reach the same level of performance as AlexNet, indicating that algorithmic progress outpaces the original Moore’s law rate of improvement in hardware efficiency, doubling effective compute every 16 months.
A problem with their approach is that it is sensitive to the exact benchmark and threshold pair that one chooses. Choosing easier-to-achieve thresholds makes algorithmic improvements look less significant, as the scaling of compute easily brings early models within reach of such a threshold. By contrast, selecting harder-to-achieve thresholds makes it so that algorithmic improvements explain almost all of the performance gain. This is because early models might need arbitrary amounts of compute to achieve the performance of today's state-of-the-art models. We show that the estimates of the pace of algorithmic progress with this approach might vary by around a factor of ten, depending on whether an easy or difficult threshold is chosen. [1]
Our work sheds new light on how algorithmic efficiency occurs, namely that it primarily operates through relaxing compute-bottlenecks rather than through relaxing data-bottlenecks. It further offers insight on how to use observational (rather than experimental) data to advance our understanding of algorithmic progress in ML.
- ^
That said, our estimates is consistent with Hernandez and Brown (2020)'s estimate that algorithmic progress doubles the amount of effective compute every 16 months, as our 95% confidence interval ranges from 4 to 25 months.
I think this question is interesting but difficult to answer based on the data we have, because the dataset is so poor when it comes to unusual examples that would really allow us to answer this question with confidence. Our model assumes that they are substitutes, but that's not based on anything we infer from the data.
Our model is certainly not exactly correct, in the sense that there should be some complementarity between compute and algorithms, but the complementarity probably only becomes noticeable for extreme ratios between the two contributions. One way to think about this is that we can approximate a CES production function
Y(C,A)=(αCρ+(1−α)Aρ)1/ρ
in training compute C and algorithmic efficiency A when C/A≈1 by writing it as
Y(C,A)=AY(elog(C/A),1)=A(αeρlog(C/A)+(1−α))1/ρ≈A(1+αρlog(C/A))1/ρ≈CαA1−α
which means the first-order behavior of the function around C/A≈1 doesn't depend on ρ, which is the parameter that controls complementarity versus substitutability. Since people empirically seem to train models in the regime where C/A is close to 1 this makes it difficult to identify ρ from the data we have, and approximating by a Cobb-Douglas (which is what we do) does about as well as anything else. For this reason, I would caution against using our model to predict the performance of models that have an unusual combination of dataset size, training compute, and algorithmic efficiency.
In general, a more diverse dataset containing models trained with unusual values of compute and data for the year that they were trained in would help our analysis substantially. There are some problems with doing this experiment ourselves: for instance, techniques used to train larger models often perform worse than older methods if we try to scale them down. So there isn't much drive to make algorithms run really well with small compute and data budgets, and that's going to bias us towards thinking we're more bottlenecked by compute and data than we actually are.