I sometimes notice that people in my community (myself included) assume that the first "generally human-level" model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you're done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious
[edit: This post should be read as "inference cost may turn out to be a bottleneck. Don't forget about them. But we don't know how inference costs will develop in the future. Additionally, it may take a while before we can run lots of copies of an extremely large model because we'd need to build new computers first.]
Inference refers to the deployment of a trained model on a new input. According to OpenAI's report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $60, comparable to the cost of a good human writer. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I've assumed greedy sampling here which is cheap but suboptimal (it's like typing without getting to revise). I'm unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we'd need to dramatically increase the world's number of supercomputers. You can't just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory.[1] These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we're talking about a 1000x larger model, today's supercomputers may not be ready to store even a single copy of it.[2]
This is not to say that a generally human-level model wouldn't have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it. In order run enough copies of the model, depending on its exact size, we'd first need to make it more efficient and build many, many new supercomputers.
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications. ↩︎
I'm told that the largest clusters these days have tens of thousands of GPUs. ↩︎
Incidentally, the latency cost of width vs depth is something I've thought might explain why the brain/body allometric scaling laws are so unfavorable and what all that expensive brain matter does given that our tiny puny little ANNs seem capable of so much: everything with a meaningful biological brain, from ants to elephants, suffers from hard (fatal) latency requirements. You are simply not allowed by Nature or Darwin to take 5 seconds to compute how to move your legs.* (Why was Gato 1 so small and so unimpressive in many ways? Well, they kept it small because they wanted it to run in realtime for a real robot. A much wider Transformer could've still met the deadline... but cost a lot more parameters and training than usual by going off the optimal scaling curves.) It does not matter how many watts or neurons you save by using a deep skinny network, if after 10 layers have fired with another 100 to go to compute the next action to take, you've been eaten by a stupider but faster-thinking predator.
So a biological brain might be forced to be deep into an unfavorable point on width vs depth - which might be extremely expensive - in order to meet its subset of robotics-related deadlines, as it were.
* With a striking counterexample, in both tininess of brain and largeness of latency, being Portia. What is particularly striking to me is not that it is so intelligent while being so tiny, but that this seems to be directly due to its particular ecological niche: there are very few creatures out there who need extremely flexible intelligent behavior but also are allowed to have minutes or hours to plan many of its actions... but Portia is one of them, as it is a stealthy predator attacking static prey. The prey also don't generally have much memory nor can they just leave their web, so a Portia can try again if the first trick didn't work. So Portia spiders are allowed to do things like spend hours circumnavigating a web to strike its prey spider from the right direction or gradually test out mimicry until it finds the right cue to trick its prey spider. So it's fascinating to see that in this highly unusual niche, it is possible to have a tiny biological brain execute extremely slow but intelligent strategies, and it suggests that if latency were not a problem, biological brains could be far more intelligent and we would not need to see such architecturally-huge biological brains to reach human-level performance, and then we would no longer have any paradox of why highly-optimized human brains seem to need so many parameters to do the same thing as tiny ANNs.