Two scenarios:
- I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what's typically used today.
- I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute
In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).
Which of the two would perform better?
CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. "More compute is actually really important" is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it's important that the model only leverage insights which were already known in 2000.
The algorithms that are used nowadays are basically the same as the algorithms that were known then, just with a bunch of tricks like dropout.
Suppose that you have 100 ideas that seem like they might work. You test them, and one of them does work. You then find a mathematical reason why it works. Is this insight of compute?
Even if most of the improvement is in compute, there could be much better algorithms that we just aren't finding. I would be unsurprised if there exists an algorithm that would be really scary on vacuum tubes.