Deploying enough of a new chip/system to run inference for a large userbase might take 12-18 months after first cloud instances become publicly available, which is in turn 6-9 months after first deliveries of the system. New hardware would have to serve smaller flagship models intended for older hardware (or train barely deployable newer models) until there is enough of the new hardware.
The current buildout of GB200/GB300 NVL72 (14/20 TB of HBM per scale-up world) improves on the weaker H100/H200/B200 (0.6/1.1/1.4 TB), and the first announcements of meaningful cloud access were around spring 2025. This is a rack that was on Nvidia's roadmap for release in 2024, and there were some sightings of it even then. Currently, there is a well-known 200 MW phase 1 of OpenAI's Abilene system, and 300 MW Azure Fairwater datacenters. Reliability is still a serious issue. Only in 2026 (probably by ~summer) will Abilene's phase 2 get completed, bringing it to 1 GW, and some more of Azure's Fairwater buildings will also be online then.
SemiAnalysis estimates 1.4 GW of inference capacity needed by OpenAI in Sep 2025, which would cost about $13-17bn per year (if half of total compute is for inference and the other half for training/experiments, this means about $30bn per year in total). Thus there isn't enough GB200/GB300 NVL72 capacity available to serve their entire userbase, and for example GPT-5 must be a smaller model intended for deployment on H100/H200/B200 (maybe even A100). The unwieldy GPT-4.5 might run fine on GB200 NVL72 (quickly and at a reasonable cost), but not on the older hardware they actually have enough of. This will only change in 2026, possibly late 2026. Thus a chip/system that's "available in 2024" only starts enabling better flagship models to be deployed in 2026, 2 years later.
Extrapolating this to Google's TPUs, Trillium (TPUv6e, 8 TB of HBM per scale-up world) was made generally available in Dec 2024. Which might mean that only now there is enough of