I sometimes notice that people in my community (myself included) assume that the first "generally human-level" model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you're done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious
[edit: This post should be read as "inference cost may turn out to be a bottleneck. Don't forget about them. But we don't know how inference costs will develop in the future. Additionally, it may take a while before we can run lots of copies of an extremely large model because we'd need to build new computers first.]
Inference refers to the deployment of a trained model on a new input. According to OpenAI's report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $60, comparable to the cost of a good human writer. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I've assumed greedy sampling here which is cheap but suboptimal (it's like typing without getting to revise). I'm unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we'd need to dramatically increase the world's number of supercomputers. You can't just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory.[1] These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we're talking about a 1000x larger model, today's supercomputers may not be ready to store even a single copy of it.[2]
This is not to say that a generally human-level model wouldn't have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it. In order run enough copies of the model, depending on its exact size, we'd first need to make it more efficient and build many, many new supercomputers.
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications. ↩︎
I'm told that the largest clusters these days have tens of thousands of GPUs. ↩︎
Thanks for elaborating I think I know what you mean now. I missed this:
My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don't have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer's weights from RAM at the same as computing the previous layer's output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
My original claim was therefore misconceived. I'll revise it to a different claim: bigger neural nets ought to have higher inference latency in general - regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn't reduce latency. However, adding more layers increases latency, and it's hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN's width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it's not quadratic, I was thinking of the parameter count].) Does that seem right to you?
The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.