Two scenarios:
- I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what's typically used today.
- I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute
In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).
Which of the two would perform better?
CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. "More compute is actually really important" is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it's important that the model only leverage insights which were already known in 2000.
Until 2017, the best performing language models were LSTMs, which have been around since 1997. However, LSTMs in their late era of dominance were distinguished from early LSTMs by experimenting with attention and a few other mechanisms, though it's unclear to me how much this boosted their performance.
The paper that unseated LSTMs for language models reported an additional 2.0 BLEU score (range from 0 to 100) gained by switching to the new model, though this is likely an underestimate of the gain by switching to Transformers given that the old state-of-the-art models were tweaked very carefully.
My guess is that the 2000 model using 2020 compute would beat the 2020 model using 2000 compute easily, though I would love to see someone to do a deeper dive into this question.