I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we'll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).
But then the possibilities for 2027 branch on whether there are reliable agents, which doesn't seem knowable either way right now. If this doesn't work out, in particular because R1-like RL training doesn't scale or generalize, then by 2027 nothing substantially new will happen, and the 2024-style slowdown sentiment will return, since 3x-5x increase in training compute is not a game-changing amount (unless there is a nearby threshold to be reached), and Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) and can't be repeated even with Rubin Ultra NVL576. At that point individual training systems will cost on the order of $100bn, and so won't have much further to scale other than at the slower pace of chip improvement (within the assumption of absence of reliable agents). The Chinese AI companies will be more than 10x but less than 100x behind in training compute (mostly because AI fails to become a priority), which can occasionally but not reliably be surmounted with brilliant engineering innovations.
The announcement post says the following on the scale of Behemoth:
we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens
This puts Llama 4 Behemoth at 5e25 FLOPs (30% more than Llama-3-405B), trained on 32K H100s (only 2x more than Llama-3-405B) instead of the 128K H100s (or in any case, 100K+) they should have. They are training in FP8 (which gets 2x more FLOP/s per chip than the easier-to-work-with BF16), but with 20% compute utilization (2x lower than in dense Llama-3-405B; training MoE is harder).
At 1:8 sparsity (2T total parameters, ~250B in active experts), it should have 3x lower data efficiency than a dense model (and 3x as much effective compute, so it has 4x effective compute of Llama-3-405B even at merely 1.3x raw compute). Anchoring to Llama-3-405B, which is dense and has 38 tokens per parameter compute optimal with their dataset, we get about 120 tokens per active parameter optimal for a model with Behemoth's shape, which for 288B active parameters gives 35T tokens. This fits their 30T tokens very well, so it's indeed a compute optimal model (and not a middle-sized overtrained model that inherited the title of "Behemoth" from a failed 128K H100s run).
In any case, for some reason they didn't do a large training run their hardware in principle enables, and even then their training run was only about 2 months (1.5 months from total compute and utilization, plus a bit longer at the start to increase critical batch size enough to start training on the whole training system). (Running out of data shouldn't be a reason to give up on 128K H100s, as a compute optimal 1:8 sparsity model would've needed only 90T tokens at 750B active parameters, if trained in FP8 with 20% compute utilization for 3 months. Which could just be the same 30T tokens repeated 3 times.)
For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn't currently known to be based on the same pretrained model as o1.
The AI 2027 story heavily leans into RL training taking off promptly, and it's possible they are resonating with some insider rumors grounded in reality, but from my point of view it's too early to tell. I guess in a few months to a year there should be enough public data to tell something, but then again a quantitative model of scaling for MoE (compared to dense) was only published in Jan 2025, even though MoE was already key to original GPT-4 trained in 2022.
Non-Google models of late 2027 use Nvidia Rubin, but not yet Rubin Ultra. Rubin NVL144 racks have the same number of compute dies and chips as Blackwell NVL72 racks (change in the name is purely a marketing thing, they now count dies instead of chips). The compute dies are already almost reticle sized, can't get bigger, but Rubin uses 3nm (~180M Tr/mm2) while Blackwell is 4nm (~130M Tr/mm2). So the number of transistors per rack goes up according to transistor density between 4nm and 3nm, by 1.4x, plus better energy efficiency enables higher clock speed, maybe another 1.4x, for the total of 2x in performance. The GTC 2025 announcement claimed 3.3x improvement for dense FP8, but based on the above argument it should still be only about 2x for the more transistor-hungry BF16 (comparing Blackwell and Rubin racks).
Abilene site of Stargate[1] will probably have 400K-500K Blackwell chips in 2026, about 1 GW. Nvidia roadmap puts Rubin (VR200 NVL144) 1.5-2 years after Blackwell (GB200 NVL72), which is not yet in widespread use, but will get there soon. So the first models will start being trained on Rubin no earlier than late 2026, much more likely only in 2027, possibly even second half of 2027. Before that, it's all Blackwell, and if it's only 1 GW Blackwell training systems[2] in 2026 for one AI company, shortly before 2x better Rubin comes out, then that's the scale where Blackwell stops, awaiting Rubin and 2027. Which will only be built at scale a bit later still, similarly to how it's only 100K chips in GB200 NVL72 racks in 2025 for what might be intended to be a single training system, and not yet 500K chips.
This predicts at most 1e28 BF16 FLOPs (2e28 FP8 FLOPs) models in late 2026 (trained on 2 GW of GB200/GB300 NVL72), and very unlikely more than 1e28-4e28 BF16 FLOPs models in late 2027 (1-4 GW Rubin datacenters in late 2026 to early 2027), though that's alternatively 3e28-1e29 FP8 FLOPs given the FP8/BF16 performance ratio change with Rubin I'm expecting. Rubin Ultra is another big step ~1 year after Rubin, with 2x more compute dies per chip and 2x more chips per rack, so it's a reason to plan pacing the scaling a bit rather than rushing it in 2026-2027. Such plans will make rushing it more difficult if there is suddenly a reason to do so, and 4 GW with non-Ultra Rubin seems a bit sudden.
So pretty similar to Agent 2 and Agent 4 at some points, keeping to the highest estimates, but with less compute than the plot suggests for months while the next generation of datacenters is being constructed (during the late 2026 to early 2027 Blackwell-Rubin gap).
Beliefs held by others are a real phenomenon, so tracking them doesn't give them unearned weight in attention, as long as they are not confused with someone else's beliefs. You can even learn things specifically for the purpose of changing their simulated mind rather than your own (in whatever direction the winds of evidence happen to blow).
The scale of training and R&D spending by AI companies can be reduced on short notice, while global inference buildout costs much more and needs years of use to pay for itself. So an AI slowdown mostly hurts clouds and makes compute cheap due to oversupply, which might be a wash for AI companies. Confusingly major AI companies are closely tied to cloud providers, but OpenAI is distancing itself from Microsoft, and Meta and xAI are not cloud providers, so wouldn't suffer as much. In any case the tech giants will survive, it's losing their favor that seems more likely to damage AI companies, making them no longer able to invest as much in R&D.
https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
It's "mainstream" here, described well many times before.
if we didn't have a capitalist system, then the entire point about profit motives, pride, and race dynamics wouldn't apply
Presence of many nations without a central authority still contributes to race dynamics.
The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times results in approximately the same loss. Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data.
This strongly suggests that repeating merely 3 times will robustly be about as useful as having 3 times more data from the same distribution. I don't know of comparably strong clues that would change this expectation.