Over on Developmental Stages of GPTs, orthonormal mentions
it at least reduces the chance of a hardware overhang.
An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.
I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
Investment Bounds
GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.
GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.
A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market.
Compute Cost
The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day.
However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.
I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound.
Other Constraints
I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.
- Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table.
- This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.
- Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely.
- But there are a lot of plausible ways to fix that, and complexity is no bar AI. This constraint might plausibly not be resolved on a timescale of months, however.
- Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data.
- It’s hard to find a good estimate on total-words-ever-written, but our library of 130m books alone would exceed 10tn words. Considering books are a small fraction of our textual output nowadays, it shouldn't be difficult to gather sufficient data into one spot once you've decided it's a useful thing. So I'd be surprised if this was binding.
- Bandwidth and latency. Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.
- I don't know enough about distributed training to say whether this is a very sensible constraint or a very dumb one. I think it has a chance of being a serious problem, but I think it's also the kind of thing you can design algorithms around. Validating such algorithms might take more than a timescale of months however.
- Hardware availability. From the estimates above there are about 500 GPU-years in GPT-3, or - based on a one-year training window - $5m worth of V100s at $10k/piece. This is about 1% of NVIDIA's quarterly datacenter sales. A 100x scale-up by multiple companies could saturate this supply.
- This constraint can obviously be loosened by increasing production, but it'd be hard to on a timescale of months.
- Commoditization. If many companies go for huge NLP models, the profit each company can extract is driven towards zero. Unlike with other capex-heavy research - like pharma - there's no IP protection for trained models. If you expect profit to be marginal, you're less likely to drop $1bn on your own training program.
- I am skeptical of this being an important factor while there are lots of legacy, human-driven systems to replace. Replacing those systems should be more than enough incentive to fund many companies’ research programs. Longer term, the effects of commoditization might become more important.
- Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you're at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with.
- I'm skeptical of this being a binding constraint. $2/page is still very cheap.
Beyond 1000x
Here we go from just pointing at big numbers and onto straight-up theorycrafting.
In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 ZFLOPS-day per model, $1bn per model - then there are a few paths open.
I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is - and I think it's plausible that it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP.
On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.
That's all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.
As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc's group occasionally surprises you. They'll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I'm not sure about Legg's current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it'll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM 'bet the company' on any moonshot; Google's cashflow isn't going anywhere, and slow and steady wins the race.
OA, lacking anything like DM's long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: "the scaling hypothesis is true" and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM's favored approach, and were always going to be an also-ran footnote.
While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it's too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3... (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, "actually, GPT-3 shows that scaling is a dead end and it's just imitation intelligence" - most people will get busy on the tweet!)
What I'll be watching for is whether orgs beyond 'the usual suspects' (MS ZeRO, Nvidia, Salesfore, Allen, DM/GB, Connor/EleutherAI, FAIR) start participating or if they continue to dismiss scaling.