I sometimes notice that people in my community (myself included) assume that the first "generally human-level" model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you're done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious
[edit: This post should be read as "inference cost may turn out to be a bottleneck. Don't forget about them. But we don't know how inference costs will develop in the future. Additionally, it may take a while before we can run lots of copies of an extremely large model because we'd need to build new computers first.]
Inference refers to the deployment of a trained model on a new input. According to OpenAI's report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $60, comparable to the cost of a good human writer. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I've assumed greedy sampling here which is cheap but suboptimal (it's like typing without getting to revise). I'm unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we'd need to dramatically increase the world's number of supercomputers. You can't just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory.[1] These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we're talking about a 1000x larger model, today's supercomputers may not be ready to store even a single copy of it.[2]
This is not to say that a generally human-level model wouldn't have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it. In order run enough copies of the model, depending on its exact size, we'd first need to make it more efficient and build many, many new supercomputers.
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications. ↩︎
I'm told that the largest clusters these days have tens of thousands of GPUs. ↩︎
I am glad we were able to work out the matter!
> If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
Beware bandwidth bottlenecks, as I mentioned in my original post. If you have a 1TB model, you need to have it somewhere with >=1TB/s effective bandwidth between storage and the compute endpoint to achieve 1 second of latency when doing an inference. And storage capacity (not to mention model size) keeps rising faster than bandwidth does...
(There are tricks here to an extent - such as compressing the model and decompressing it on-target - but they seldom save much. (And if they do, that just means your model is inefficient...))
According to a random guy on the internet, GPT-3 is ~300GB compressed. PCIe gen4x16 is ~31.5GB/s. If you have 1s of latency, that means that you can only stream in ~31.5GB per card. (In addition to what's already stored in RAM.)
That being said, as far as I can tell it is - in theory - possible to run a GPT-3 inference on a single Threadripper Pro platform (or something else with 128 lanes of gen4 pcie), with 8x 6GB graphics cards in 1 second, if you have 300GB of DRAM lying around. (Or 4x 12GB graphics cards in 2 seconds, with the other half of the pcie lanes filled with gen4 SSDs.)
(In practice I strongly suspect you'll hit some unknown limit in the PCIe root complex or thereabouts. This is shuffling something silly like 250GB/s of data through that one poor root complex.)
(It's a pity that there's no good way to ask a GPU to pull data directly from an SSD. ICMB could help, but it requires GPU-side software support. Most of this data stream could go directly from SSD to PCIe switch to graphics card without having to be bounced through the root port...)
(Yes, 8x gpu->gpu communications will hurt overall latency... but not by all that much I don't think. 1 second is an eternity.)
> As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn't reduce latency.
Indeed. And indeed, increases it, as you're adding GPU-->GPU trips to the critical path.