This is a special post for quick takes by Vladimir_Nesov. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1.

In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbe

1.
^

There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.

1.

SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎

2.

Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎

3.

At 4 months, with $2/hour, this takes $3

1.

There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd

1.

A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎

1.

The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎

78 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think.

Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.

7TsviBT
The bitter lesson says that there are many things you don't need to understand, but it doesn't say you don't need to understand anything. I think you're doing a "we just need X" with recursive self-improvement. The improvement may be iterable and self-applicable... but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.
2Nathan Helm-Burger
Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded... The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.  For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won't block them all).  Another prediction we might make is that seeing some rapid progress doesn't guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy. 
6faul_sname
Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because 1. It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function. 2. It means that multi-agent dynamics will be very relevant in how things happen If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.

No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.

4cubefox
Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I'm unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.

Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn't depend on such capabilities, and doesn't stall for their lack, like evolution doesn't stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.

Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I'm still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.

(It's useful to frame most ideas as exploratory engineering rather than forecasting. The question of whe... (read more)

Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren't required, it's mostly an engineering task of scaling compute. 

3bohaska
This seems like the sort of R&D that China is good at: research that doesn't need superstar researchers and that is mostly made of incremental improvements. But yet they don't seem to be producing top LLMs. Why is that?
7Alexander Gietelink Oldenziel
China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs. A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently. There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses. In any case, I wouldn't assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.
2Tomás B.
@gwern

The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they'd not been crossed yet by then.

GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn't hit a transformative commercial success).

If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I'm estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)

4ryan_greenblatt
We also hit limits on fab capacity without constructing a bunch more fabs around a similar time. ---------------------------------------- Price performance of 2.2x per year feels aggressive to me. The chip only trend is more like 1.35x / year from understanding. Do you think the ML chip trend is much faster than this? I don't see how you could have a 2.2x price drop per year longer term without chip price performance following as eventually chips will be the bottleneck even if other costs (e.g., interconnect, building datacenters) are dropping. Edit: this was 2.2x every 2 years, I was just confused.
6Vladimir_Nesov
If I'm reading the relevant post correctly, it's 1.35x FP32 FLOP/s per GPU per year (2x in 2.3 years), which is not price-performance[1]. The latter is estimated to be 1.4x FP32 FLOP/s per inflation-adjusted dollar (2x in 2.1 years). It's 2.2x per 2 years, which is 1.5x per year, though that's still more than 1.4x per year. I'm guessing packaging is part of this, and also Nvidia is still charging a giant margin for the chips, so the chip manufacturing cost is far from dominating the all-in datacenter cost. This might be enough to sustain 1.5x per year a bit beyond 2030 (the discrepancy of 1.5/1.4 only reaches 2x after 10 years). But even if we do get back to 1.4x/year, that only turns the 3.3x reduction in speed of pretraining scaling into 3.9x reduction in speed, so the point stands. ---------------------------------------- 1. Incidentally, the word "GPU" has recently lost all meaning, since Nvidia started variably referring to either packages with multiple compute dies in them as GPUs (in Blackwell), or to individual compute dies (in Rubin). Packaging will be breaking trends for FLOP/s per package, but also FLOP/s per compute die, for example Rubin seems to derive significant advantage per compute die from introducing separate smaller I/O dies, so that the reticle sized compute dies become more specialized and their performance when considered in isolation might improve above trend. ↩︎
3ryan_greenblatt
Oh oops, I just misread you, didn't realize you said 2.2x every 2 years, nvm.

A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.

If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won't be out in 2025 and officially "GPT-5" will mean something else (since it's due "in months" in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it's "Q4 2025" since their FY 2025 runs to end of Jan 2025).


  1. In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbe

... (read more)

I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low. 

My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.

Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]

  1. ^

    There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.

The marketing terminology is inconvenient, a "superchip" can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it's better to talk in terms of chips (that are not "superchips"), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn't consistently help because it tends to get reported at randomly chosen precision and sparsity[1].

Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it's not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it's just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole... (read more)

3romeo
That's indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that 'GB200' mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a 'NVL__'). Are there counterexamples to this? I scanned the links you mentioned and only saw 'GB200 NVL2,' 'GB200 NVL4,' 'GB200 NVL72' respectively.  I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of 'GB200 vs B200' the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: "the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU..."
5Vladimir_Nesov
"GB200 superchip" seems to be unambiguously Grace+2xB200. The issue is "100K GB200 GPUs" or "100K GB200 cluster", and to some extent "100K GPU GB200 NVL72 cluster". Also, people will abbreviate various clearer forms to just "GB200". I think "100K chip GB200 NVL72 training system" less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to "100K GB200 system".
5romeo
Good point, thanks. Previously I would have pretty confidently read "100K GB200 GPUs," or "100K GB200 cluster" as 200K B200s (~= 500K H100s) but I can see how it's easily ambiguous. Now that I think of it, I remembered this Tom's Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...

Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it's less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute "GPT-5" (2e27 FLOPs) by the end of 2025 would require using FP8[3].

The Crusoe post says "initial phase, comprising two buildings at ... 200+ megawatts" and "each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s". Dylan Patel's estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it's way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).

So the "50K GB200 NVL72s" per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the "100K GPUs" per building from the Jul 2024 Crusoe post must've meant 100K compute dies (which is 50K chips). It... (read more)

A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.

A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).

But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be abo... (read more)

5ryan_greenblatt
I agree compute optimal MoEs don't improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data. As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute. Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)
3Vladimir_Nesov
Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters. With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining. Now consider the 1e20 FLOPs plot in Figure 12, left. If there's only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.
2ryan_greenblatt
I'm currently skeptical and more minimally, I don't understand the argument you're making. Probably not worth getting into. I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don't see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument. Regardless, I don't think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)
4Vladimir_Nesov
With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.
1Archimedes
Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?

Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.

It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.

It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters k... (read more)

New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).

So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.


  1. SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎

  2. Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎

  3. At 4 months, with $2/hour, this takes $3

... (read more)

Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.

For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.

There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.


  1. There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd

... (read more)
3romeo
Thanks Vladimir, this is really interesting! Re: OpenAI's compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would've guessed they'd be on track to be using around 400k H100s by now.  So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible? The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively. 1. ^ $6e9 / 365.25d / 24h / $2.5/hr = 274k
4Vladimir_Nesov
Training as it's currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn't training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system. So I expect the big expenses are inference and many training experiments with smaller models. What I'm discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment. Patel's claim is 100K H100s at 150 megawatts.
5Aaron_Scher
I think that's probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:  As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet). 
5Vladimir_Nesov
This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale. The "cluster" label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1). Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step[1], which means latency doesn't matter, only bandwidth. I'm not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that's 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them. Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that's our 16M tokens per minibatch, for 1M minibatches[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B's 6 seconds down to less than that, making the necessary gradient communication bandwidth higher. Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And

And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn't doing image generation, it isn't doing voice synthesis, it isn't doing video generation... (As far as we know they aren't researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That's it.

But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to 'concise' responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to 'overwrite', especially in what I was working on.)

5Daniel Kokotajlo
I'd guess that the amount spent on image and voice is negligible for this BOTEC?  I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn't that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?

Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)


  1. A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎

5Thane Ruthenis
Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.[1] I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case. 1. ^ Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
5ryan_greenblatt
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.) I'd guess this paper doesn't have the actual optimal methods.

o3 has a different base model (presumably). 

All of the figures are base model equivalated between RL and not

I would expect "this paper doesn't have the actual optimal methods" is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.

But if one has enough money, you can finetune GPT models, and test that.

Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper. 

Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths). 

We know that RL decreases model entropy, so the first k passes will be more different for a high variance model. 

Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples). 

At very large K, we would expect variance to matter more than mean. 

9Ivan Vendrov
this isn’t evidence against OP? if it’s true that RL lowers pass@k performance for sufficiently large k, we’d certainly expect o1 with 10k submissions to be weaker than base/instruct with 10k submissions.
6Vladimir_Nesov
It's evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn't expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it's not actually worse.
7faul_sname
If I'm interpreting the paper correctly the k at which base models start beating RL'd models is a per-task number, and k can be arbitrarily high for a given task, and the 50-400 range was specifically for tasks of the type the authors chose within a narrow difficulty band. Let's say you have a base model which performs at 35% on 5 digit addition, and an RL'd model which performs at 99.98%. Even if the failures of the RL'd model are perfectly correlated, you'd need k=20 for base@20 to exceed the performance of fine-tuned@20. And the failures of the RL model won't be perfectly correlated - but this paper claims that the failures of the RL model will be more correlated than the failures of the base model, and so the lines will cross eventually, and "eventually" was @50 to @400 in the tasks they tested. But you could define a task where you pass in 10 pairs of 5 digit numbers and the model must correctly find the sum of each pair. The base model will probably succeed at this task at somewhere on the order of 0.35^10 or about 0.0003% of the time, while the RL'd model should succeed about 99.8% of the time. So for this task we'd expect k in the range of k=220,000 assuming perfectly-correlated failures in the RL model, and higher otherwise. Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. "output random tokens") will outperform base models for some tasks by the pass@k metric.
6gwern
It would be an extreme bias-variance tradeoff, yes.
3Vladimir_Nesov
The interesting concept in the paper is the location of the crossover point, which seems remarkably stable (for a given task) across specific RL techniques and amount of RL training. It can be measured experimentally for a task by doing a little bit of RL training, and RL@1 performance won't get better than that with more training, so you're unlikely to get the RL model to succeed 99.8% of the time (at pass@1) ever unless the level of performance of the base model at the crossover point with a weak RL model was already higher than 99.8%. Probably the crossover point for a task depends on things that can be changed (such as strength of the pretrained model, or size/relevance of the verifiable task dataset, or possibly the inference time reasoning budget). The issue isn't for example as straightforward as losing entropy in RL policy (as a formulation of reduced exploration), since DAPO specifically addresses this issue (otherwise present in vanilla GRPO), but the pass@k plot for DAPO (Figure 7, top) barely moves (compared to other methods), in their experiment it's even slightly worse at the crossover point. So in the context of this paper it remains unclear how to move the plot to reach ever higher base@k performance using RL@1, higher than the ceiling of where base@k already was at the crossover point when comparing with some method at only 100-500 RL steps.
3Thane Ruthenis
Intuitively, this shouldn't matter much. They use some RL-on-CoTs method that works, and I expect its effects are not fundamentally different from optimal methods'. Thus, optimal methods might yield better quantitative results, but similar qualitative results: maybe they'd let elicit pass@800 capabilities instead of "just" pass@400, but it'd still be just pass@k elicitation for not-astronomical k. Not strongly convinced of that, though.
3Vladimir_Nesov
In the hypothetical where the paper's results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1's base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1's base model). So the difference of 200x (10K vs. 50) in the number of samples becomes much smaller when comparing performance of the base models. For GPT-4o vs. GPT-4.1, a difference of ~4x in the number of samples doesn't seem too strange. There's also the possibility of distillation from a reasoning variant of GPT-4.5, which could have an even larger effect on pass@k performance at low k (Figure 6, right).
1mrtreasure
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?

Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.

OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.

Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.

5anaguma
How does Anthropic and XAi’s compute compare over this period?

What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that

I would not be surprised if in 2026 we have more than a million of some kind of chip.

Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.

2Lorenzo
For context, average US electricity consumption in 2022 was ~500GW. So these would be ~1% of all US electricity consumption (as an order of magnitude)

GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk's Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can't train a 2e27 FLOPs model.

The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there's already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI's pace, and go into pretraining for 4 months, then with 1 more month of post-training it's already November.

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention

Per Altman:

In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.

I think he's pretty plainly saying that this "GPT-5" will be a completely different thing from a 100x'd GPT-4.

3Vladimir_Nesov
This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I'm guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.
3Thane Ruthenis
I don't see it. He says that GPT-5 will be a system that "integrates o3". This isn't his sloppy way of saying "integrates the reasoning techniques": when he wants to express that idea, he talks about "unifying o-series models and GPT-series models". The wording regarding GPT-5 is consistent with him literally saying that the model o3 will be part of GPT-5. Furthermore, I take "as" in "GPT-5 as a system that integrates a lot of our technology" to mean "GPT-5 is defined as {a system that integrates a lot of our technology, including o3}". Not "GPT-5 will be trained to automatically switch between a standard mode, a reasoning mode, a Deep Research mode, etc.", not even "GPT-5 will be trained to recognize when to fall back to o3, a lesser model", but literally "we're slapping the GPT-5 label on a glorified wrapper over all our current models".
5Vladimir_Nesov
The "glorified wrapper" could still be a 2e27 FLOPs model, it could even be using literal o3 as one of its tools (in addition to all the other tools, with native GPT-5 long reasoning mostly reserved for premium tier). This is in line with the "agents" agenda where better reliability in taking irreversible actions unlocks new use cases, in this case whether to make use of expensive reasoning calls. Since "GPT-4.5" will actually be released rather than skipped, it's less plausible for "GPT-5" to come out shortly after. If it's announced in ~Dec 2025 (the way o3 was), it's still "within months", and then it can actually get released in ~Feb 2026.
2Thane Ruthenis
Hm, fair enough. Seems like a stretch, though, especially given the need to interpret his "ETA in months" as "will be officially announced in months and released in a year".
5Vladimir_Nesov
There was also Murati in Jun 2024 predicting PhD level AI in 18 months. If they succeed in achieving parity with xAI in terms of safety procedures, they might even release a preview checkpoint in Dec 2025 for Pro users. So actual release in a year is not strictly necessary for this hypothesis, it's just closer to what they've done in the past.

if OpenAI follows the usual naming convention of roughly 100x in raw compute.

I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

4Vladimir_Nesov
I'm merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I'm guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that's more concrete.

Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.

The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.

OpenAI would want to get from under Microsoft's thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is som... (read more)

When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.

So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it'll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.

This behavior doesn't need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.

2Noosphere89
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman's view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use: https://amistrongeryet.substack.com/p/defining-agi
1LWLW
I think short timelines just don’t square with the way intelligence agencies are behaving. The NSA took Y2K more seriously than it currently seems to be taking near-term AGI. You can make the argument that intelligence agencies are less competent than they used to be, but I don’t buy that they aren’t at least extremely paranoid and moderately competent: that seems like their job.
7Thane Ruthenis
Researchers at AGI labs seem to genuinely believe the hype they're selling, a significant fraction of non-affiliated top-of-the-line DL researchers is inclined to believe them as well, and basically all competent well-informed people agree that the short-timelines position is not unreasonable to hold. Dismissing short timelines based on NSA's behavior requires assuming that they're much more competent in the field of AI than everyone in the above list. After all, that'd require them to be strongly (and correctly) confident that all these superstar researchers above are incorrect. While that's not impossible, it seems highly unlikely to me. Much more likely that they're significantly less competent, and accordingly dismissive.

Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It's above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It's above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.


  1. The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎

4Vladimir_Nesov
Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn: Assuming it's trained in BF16 with 40% compute utilization, that's a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it's not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it's trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.

Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn't be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.

2Mateusz Bagiński
Are you trying to say that for any X, instead of X-maturity, we should instead expect X-foom until the marginal returns get too low?
2Vladimir_Nesov
A pre-abundance precedent about X offers poor framing for thinking about the consequences of discovering a scalable process of producing X. Before abundance, it's artisanal and quirky and path-dependent, the extremes are rare and dysfunctional, so people don't worry about it too much. There is security in it looking like an equilibrium, but not being truly settled, so that people can influence things. Abundance brings maturity, changes the character of the equilibrium. So not foom necessarily, just a promise of maturity at some point, which wouldn't have been as salient before there is a scalable process of production. And there is an excuse of ignoring the possibility even longer, because of the total lack of historical precedent (of the associated problems).
1Kaarel
i’d be interested in hearing why you think that cultural/moral/technological/mathematical maturity is even possible or eventually likely (as opposed to one just being immature forever[1]) (assuming you indeed do think that) ---------------------------------------- 1. which seems more likely to me ↩︎
2Vladimir_Nesov
I mean "maturity" merely compared to how we view what can currently be happening, such as a baseline level of competence in civilization-level governance, or what the individual people are capable of. Maturity compared to that baseline washes away all the currently relevant fiddly things, replacing them by settled processes. These new processes are truly settled, so whatever new concerns become important then, the new baseline won't be overturned. The analogy with technological maturity is that the laws of physics and ways of getting things done within them is a fixed problem statement, so new baselines of effectiveness get locked in.

Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.

A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.

The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the ... (read more)

1CstineSublime
Are there pivotal ways this is different to the theories of Enactivism? (" Its authors define cognition as enaction, which they in turn characterize as the ‘bringing forth’ of domains of significance through organismic activity that has been itself conditioned by a history of interactions between an organism and its environment." which at first blush I'd say is a reflectively stable agent modifying or updating believes by means of enaction. Enactivism also rejects mind-body duality in favour of a more 'embodied' cognition approach together with a "deep continuity of the principles of self-organization from the simplest living things to more complex cognitive beings"), particularly autopoeisis.   An autopoietic system can be contrasted to an allopoetic system which creates objects different to itself, like a factory. Most living beings are autopoetic in that they either produce themselves, or things like them which seems to be similar to a reflectively stable agent, particularly when we describe the more complicated cognitive beings in autopoetic terms. Luhman argued that social systems too are self-organizing, self-reproducing systems which brought the concepts of enactivism from biology and cognitive science into the social sciences.