Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It's easier to stay within a single datacenter, and so far it wasn't necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it's very hard to implement.
There's also this snippet in the Gemini report:
Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn't seem necessary for enabling existence of a $2 trillion AI company.
Thank for the great comment!
Do we know if distributed training is expected to scale well to GPT-6 size models (100 trillions parameters) trained over like 20 data centers? How does the communication cost scale with the size of the model and the number of data centers? Linearly on both?
After reading for 3 min this:
Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips (Google November 2023). It seems that scaling is working efficiently at least up to 50k GPUs (GPT-6 would be like 2.5M GPUs). There are also some surprising linear increases in start time with the number of GPUs, 13min for 32k GPUs. What is the SOTA?
The title is clearly an overstatement. It expresses more that I updated in that direction, than that I am confident in it.
Also, since learning from other comments that decentralized learning is likely solved, I am now even less confident in the claim, like only 15% chance that it will happen in the strong form stated in the post.
Maybe I should edit the post to make it even more clear that the claim is retracted.
Amazon recently bought a 960MW nuclear-powered datacenter.
I think this doesn't contradict your claim that "The largest seems to consume 150 MW" because the 960MW datacenter hasn't been built (or there is already a datacenter there but it doesn't consume that much energy for now)?
My credence: 33% confidence in the claim that the growth in the number of GPUs used for training SOTA AI will slow down significantly directly after GPT-5. It is not higher because of (1) decentralized training is possible, and (2) GPT-5 may be able to increase hardware efficiency significantly, (3) GPT-5 may be smaller than assumed in this post, (4) race dynamics.
TLDR: Because of a bottleneck in energy access to data centers and the need to build OOM larger data centers.
Update: See Vladimir_Nesov's comment below for why this claim is likely wrong, since decentralized training seems to be solved. As a consequence, I updated my credence in the claim exposed in this post from 33% to 15%.
The reasoning behind the claim:
Unrelated to the claim:
How big is that effect going to be?
Using values from: https://epochai.org/blog/the-longest-training-run, we have estimates that in a year, the effective compute is increased by:
Let's assume GPT-5 is using 10 times more GPUs than GPT-4 for training. 250k GPUs would mean around 250MW needed for training. This is already larger than the largest data center reported in this article... Then, moving to GPT-6 with 2.5M GPUs would require 2.5 GW.
Building the infrastructure for GPT-6 may require a few years (e.g., using existing power plants and building a 2.5M GPU data center). For reference, OpenAI and Microsoft seem to have a $100B data center project going until 2028 (4 years); that’s worth around 3M B200 GPUs (at $30k per units).
Building the infrastructure for GPT-7 may require even more time (e.g., building 25 power plant units).
If the infrastructure for GPT-6 takes 4 years to be assembled, then the increase in GPUs is limited to 1 OOM in 4 years (~ x1.8/year).
The total growth rate between GPT-4 and GPT-5 is x22/year or x6.2/year when using investment growth values from before ChatGPT.
Taking into account the decrease in the growth of investment in training runs, the total growth rate between GPT-5 and GPT-6 would then be x4/year. The growth rate would be divided by 5.5 or by 1.55 when using values from before ChatGPT.
These estimates assume no efficient decentralized training.
Impact of GPT-5
One could assume that software and hardware efficiency will have a growth rate increased by something like 100% because of the increased productivity from GPT-5 (vs before ChatGPT).
In that case, the growth rate of effective compute after GPT-5 would be significantly above the growth rate before ChatGPT (~ x8.8/year vs. ~ x6/year before ChatGPT).