This recent post by OpenAI is trying to shed some light in this question: https://openai.com/blog/ai-and-efficiency/
Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it's unclear to me how much this boosted their performance.
The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.
My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.
I think they would both suck honestly. Many things have changed in 20 years. Datasets, metrics, and architectures have all changed significantly.
I think the relationship between algorithms and compute looks something like this.
For instance, look at language models. LSTMs had been introduced 3 years prior. People mainly used n-gram markov models for language modelling. N-grams don't really scale and training a transformer using as much resources as you need to train an N-gram model would probably not work at all. In fact, I don't think you even really "train" an N-gram model.
The same goes for computer vision. SVM's using the kernel trick have terrible scaling properties. (O(N^3) in the number of datapoints, but until compute increased, they worked better. See the last slide here
You often hear the complaint that the algorithms we use were invented 50 years ago, and many NN techniques fall in and out of fashion.
I think this is all because of the interactions between algorithms and compute/data. The best algorithm for the job changes as a function of compute, so as compute grows, new methods that previously weren't competitive suddenly start to outperform older methods.
I think this is a general trend in much of CS. Look at matrix multiplication. The naive algorithm has a small constant overhead, but N^3 scaling. You can use group theory to come up with algorithms that have better scaling, but have a larger overhead. As compute grows, the best matmul algorithm changes.
Weak evidence for compute: apparently the original TD-gammon code from 1992 performed quite well when run with modern amount of compute (source).
My view: Although I think it is a neat thought experiment, my intuition is it is a false dichotomy to separate between compute and algorithm, and I think so because: narrowing the path dependence of a domain that consists of multiple requirements for it to evolve optimally to an "either/or" situation usually leads to deadlocks that can be paradoxical(not all deadlocks have to remain paradoxical, pre-emption/non-blocking synchronization is a way out) like the one above.
My answer: Not much difference, because twenty-year timescale doesn't seem very significant to me; and also because neither has there been any fundamental revolution in the semiconductor/compute-manufacturing industry that has benefitted us in ways other than cost, and nor has there been any revolutionary algorithms found that couldn't be run with old hardware scaled to today's standards. (But in complex systems (which ML is) interactions matters more than anything else, so I might be way off here)
The algorithms that are used nowadays are basically the same as the algorithms that were known then, just with a bunch of tricks like dropout.
Suppose that you have 100 ideas that seem like they might work. You test them, and one of them does work. You then find a mathematical reason why it works. Is this insight of compute?
Even if most of the improvement is in compute, there could be much better algorithms that we just aren't finding. I would be unsurprised if there exists an algorithm that would be really scary on vacuum tubes.
Which of the two would perform better?
Will the experiment be run?
What is the experiment? What is the question?
I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what's typically used today.
Guess A. Is the difference (between 2000 and today) modern compute?
I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute
Guess B. Is the difference (between 2000 and today) modern compute costs?
But the experiment doesn't seem to be about A or B. More likely it's about both:
Which is more important (to modern ML performance (in what domain?*)):
(Minor technical note - if you're comparing results from the past, to results today, while it might be impossible to go back in time and test these things for a control group, rather than taking 'things weren't as good back then' for granted, this should be tested as well for comparison. (Replicate earlier results.**)
This does admit other hypotheses.
For example, 'the difference between 2020 and 2000 is that training took a long time, and if people set things up wrong, they didn't get feedback for a long time. Perhaps modern compute enables researchers to set ML programs up correctly despite the code not being written right the first time.')
A and B can be rephrased as:
*This might be intended as a more general question, but the post asks about:
vision or language model[s.]
**The most extreme version would be getting/recreating old machines and then re-running old ML stuff on them.
The underlying question I want to answer is: ML performance is limited by both available algorithms and available compute. Both of those have (presumably) improved over time. Relatively speaking, how taut are those two constraints? Has progress come primarily from better algorithms, or from more/cheaper compute?
Two scenarios:
In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).
Which of the two would perform better?
CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. "More compute is actually really important" is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it's important that the model only leverage insights which were already known in 2000.