I sometimes notice that people in my community (myself included) assume that the first "generally human-level" model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you're done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious
[edit: This post should be read as "inference cost may turn out to be a bottleneck. Don't forget about them. But we don't know how inference costs will develop in the future. Additionally, it may take a while before we can run lots of copies of an extremely large model because we'd need to build new computers first.]
Inference refers to the deployment of a trained model on a new input. According to OpenAI's report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $60, comparable to the cost of a good human writer. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I've assumed greedy sampling here which is cheap but suboptimal (it's like typing without getting to revise). I'm unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we'd need to dramatically increase the world's number of supercomputers. You can't just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory.[1] These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we're talking about a 1000x larger model, today's supercomputers may not be ready to store even a single copy of it.[2]
This is not to say that a generally human-level model wouldn't have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it. In order run enough copies of the model, depending on its exact size, we'd first need to make it more efficient and build many, many new supercomputers.
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications. ↩︎
I'm told that the largest clusters these days have tens of thousands of GPUs. ↩︎
To give a concrete example:
Say each layer takes 10ms to process. The NN has 100 layers. It takes 40ms to round-trip weight data from the host (say it's on spinning rust or something). You can fit 5 layers worth of weights on a gpu, in addition to activation data / etc.
On a GPU with a "sufficiently large" amount of memory, such that you can fit everything on-GPU, this will have 1.04s latency overall. 40ms to grab all of the weights into the GPU, then 1s to process.
On a GPU, with no pipelining, loading five layers at a time then processing them, this will take 1.8 seconds latency overall. 40ms to load from disk, then 50 ms to process, for each group of 5 layers.
On a GPU, with pipelining, this will take... 1.04s overall latency. t=0ms, start loading layer 1 weights. t=10ms, start loading layer 2 weights. ... t=40ms, start loading layer 5 weights & compute layer 1, t=50ms, start loading layer 6 weights & compute layer 2, etc. (Note that this has a max of 5 'active' sets of weights at once, like in the no-pipelining case.)
(A better example would split this into request latency and bandwidth.)
> Every stage of your pipeline processes e.g. one of the NN layers. Then stage N has to wait for the earlier stages to be completed before it can compute the output of layer N. That's why the latency to compute f(x) is high.
To be clear: I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer's computation.
I can be loading the NN weights for layer N+1 while I'm working on layer N. There's no dependency on the activations of the previous layer.
> pipelining doesn't help with latency
Let me give an example (incorrect) exchange that hopefully illustrates the issue.
"You can never stream video from a remote server, because your server roundtrip is 100ms and you only have 20ms per frame".
"You can pipeline requests"
"...but I thought pipelining doesn't help with latency?"
(This example is oversimplified. Video streaming is not done on a per-frame basis, for one.)
The key is: pipelining doesn't help with latency of individual requests. But that's not what we care about here. What we care about is the latency from starting request 1 to finishing request N - which pipelining absolutely does help with. (Assuming that you don't have pipeline hazards at least - which we don't.)
*****
All of the above being said, this only helps with the "my weights don't fit in my GPU's RAM" portion of things (which is what my original comment was responding to). If running an inference takes a billion floating-point ops and your GPU runs at a gigaflop, you're never going to be able to run it in under a second on a single GPU. (Ditto, if your weights are 16GB and your GPU interface is 16GB/s, you're never going to be able to run it in under a second on a single GPU... assuming you're not doing something fancy like decompressing on-GPU at least.)