Two scenarios:
- I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what's typically used today.
- I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute
In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).
Which of the two would perform better?
CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. "More compute is actually really important" is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it's important that the model only leverage insights which were already known in 2000.
My view: Although I think it is a neat thought experiment, my intuition is it is a false dichotomy to separate between compute and algorithm, and I think so because: narrowing the path dependence of a domain that consists of multiple requirements for it to evolve optimally to an "either/or" situation usually leads to deadlocks that can be paradoxical(not all deadlocks have to remain paradoxical, pre-emption/non-blocking synchronization is a way out) like the one above.
My answer: Not much difference, because twenty-year timescale doesn't seem very significant to me; and also because neither has there been any fundamental revolution in the semiconductor/compute-manufacturing industry that has benefitted us in ways other than cost, and nor has there been any revolutionary algorithms found that couldn't be run with old hardware scaled to today's standards. (But in complex systems (which ML is) interactions matters more than anything else, so I might be way off here)