Vladimir_Nesov's Shortform

Vladimir_Nesov

LESSWRONG
LW

Vladimir_Nesov's Shortform

by Vladimir_Nesov

4th Oct 2024

AI Alignment Forum

1 min read

10 Ω 4

This is a special post for quick takes by Vladimir_Nesov. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

50DeekSeek v3: The Six Million Dollar Model

52 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:30 PM

[-]Vladimir_Nesov6mo*6535

Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think.

Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.

[-]TsviBT6mo64

The bitter lesson says that there are many things you don't need to understand, but it doesn't say you don't need to understand anything.

I think you're doing a "we just need X" with recursive self-improvement. The improvement may be iterable and self-applicable... but is it general? Is it on a bounded trajectory or an unbounded trajectory? Very different outcomes.

[-]Nathan Helm-Burger4mo20

Yeah, although I am bullish on the general direction of RSI, I also think that in the details it factors into many dimensions of improvement. Some of which are likely fast-but-bounded and will quickly plateau, others which are slow-but-not-near-term-bounded... The fact that there are many different dimensions over which RSI might operate makes it hard to predict precisely, but does give some general predictions.

For instance, we might expect it not to be completely blocked (since there will be many independent dimensions along which to apply optimization pressure, so blocking one won't block them all).

Another prediction we might make is that seeing some rapid progress doesn't guarantee that either a complete wall will be hit soon or that progress will continue just as fast or faster. Things might just be messy, with a jagged inconsistent line proceeding up and to the right. Zoom out enough, and it may look smooth, but for our very-relevant-to-us near-term dynamics, it could just be quite noisy.

[-]faul_sname6mo61

Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because

It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
It means that multi-agent dynamics will be very relevant in how things happen

If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.

[-]Vladimir_Nesov6mo117

No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.

[-]cubefox6mo*4-16

Cutting edge AI research is one of the most difficult tasks humans are currently working on, so the intelligence requirement to replace human researchers is quite high. It is likely that most ordinary software development, being easier, will be automated before AI research is automated. I'm unsure whether LLMs with long chains of thought (o1-like models) can reach this level of intelligence before human researchers invent a more general AI architecture.

[-]Vladimir_Nesov6mo2110

Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn't depend on such capabilities, and doesn't stall for their lack, like evolution doesn't stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.

Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I'm still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.

(It's useful to frame most ideas as exploratory engineering rather than forecasting. The question of whether something can happen, or can be done, doesn't need to be contextualized within the question of whether it will happen or will be done. Physical experiments are done under highly contrived conditions, and similarly we can conduct thought experiments or conceptual arguments under fantastical or even physically impossible conditions. Thus I think Carl Shulman's human level AGI world is a valid exploration of the future of AI, even though I don't believe that most of what he describes happens in actuality before superintelligence changes the premise. It serves as a strong argument for industrial and economic growth driven by AGI, even though it almost entirely consists of describing events that can't possibly happen.)

[-]Alexander Gietelink Oldenziel6mo1412

Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren't required, it's mostly an engineering task of scaling compute.

[-]Bohaska6mo32

This seems like the sort of R&D that China is good at: research that doesn't need superstar researchers and that is mostly made of incremental improvements. But yet they don't seem to be producing top LLMs. Why is that?

[-]Alexander Gietelink Oldenziel6mo74

China is producing research in a number of areas right now that is surpassing the West and arguably more impressive scientifically than producing top LLMs.

A big reason China is lagging a little bit might be political interference at major tech companies. Xi Jinping instigated a major crackdown recently. There is also significantly less Chinese text data. I am not a China or tech expert so these sre just guesses.

In any case, I wouldn't assign it to much significance. The AI space is just moving so quickly that even a minor year delay can seem like lightyears. But that doesnt mean that Chinese companies cant so it or that a country-continent with 1,4 billion people and a history of many technological firsts cant scale up a transformer.

[-]Tomás B.6mo20

@gwern

[-]Vladimir_Nesov22d*450

A surprising report by Bloomberg claims 16K GB200^[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.

If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won't be out in 2025 and officially "GPT-5" will mean something else (since it's due "in months" in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it's "Q4 2025" since their FY 2025 runs to end of Jan 2025).

In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbers too high). ↩︎

[-]romeo21d120

I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.

My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.

Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.^[1]

^{^}
There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.

[-]Vladimir_Nesov21d100

The marketing terminology is inconvenient, a "superchip" can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it's better to talk in terms of chips (that are not "superchips"), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn't consistently help because it tends to get reported at randomly chosen precision and sparsity^[1].

Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it's not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it's just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole cluster can't be used yet if the scale-up worlds are too small (see Section 3.4.1 of Llama 3 report).

From the initial post by Crusoe (working on the Abilene campus), there is a vague mention of 200 MW and a much clearer claim that each data center building will host 100K GPUs. For GB200, all-in power per chip is 2 KW, so the 200 MW fits as a description of a data center building. The video that went out at the time of Jan 2025 Stargate announcement and also a SemiAnalysis aerial photo show two 4-section buildings. Dylan Patel claimed on Dwarkesh Podcast that the largest single-site campus associated with OpenAI/Microsoft being built in 2025 can hold 300K GB200 chips. From this I glean and I guess that each 4-section building can hold 100K chips of GB200 requiring 200 MW, and that they have two of these mostly built. And 200K chips of GB200 are sufficient to train a 2e27 FLOPs model (next scale after Grok 3's ~3e26 FLOPs), so that makes sense as a step towards pretraining independence from Microsoft. But 16K chips or possibly 16K NVL4 superchips won't make a difference, 100K H100s are on the same level (which GPT-4.5 suggests they already have available to them) and for inference Azure will have more Blackwells this year anyway.

For pretraining, you need dense compute rather than sparse. It's unclear if FP8 rather than BF16 is widely used in pretraining of frontier models that are the first experiment at a new scale, or mostly in smaller or optimized models. But the GPT-4.5 announcement video vaguely mentions work on low precision in pretraining, and also high granularity MoE of the kind DeepSeek-V3 uses makes it more plausible for the FFN weights. ↩︎

[-]romeo21d30

That's indeed inconvenient. I was aware of NVL2, NVL4, NVL36, NVL72, but I was under the impression that 'GB200' mentioned on its own always means 2 Blackwells, 1 Grace (unless you add on a 'NVL__'). Are there counterexamples to this? I scanned the links you mentioned and only saw 'GB200 NVL2,' 'GB200 NVL4,' 'GB200 NVL72' respectively.

I was operating on this pretty confidently but unsure where else I saw this described (apart from the column I linked above). On a quick search of 'GB200 vs B200' the first link I found seemed to corroborate GB200 = 2xB200s + 1xGrace CPU. Edit: second link also says: "the Grace-Blackwell GB200 Superchip. This is a module that has two B200 GPUs wired to an NVIDIA Grace CPU..."

[-]Vladimir_Nesov21d51

"GB200 superchip" seems to be unambiguously Grace+2xB200. The issue is "100K GB200 GPUs" or "100K GB200 cluster", and to some extent "100K GPU GB200 NVL72 cluster". Also, people will abbreviate various clearer forms to just "GB200". I think "100K chip GB200 NVL72 training system" less ambiguously refers to the number of B200s, but someone unfamiliar with this terminological nightmare might abbreviate it to "100K GB200 system".

[-]romeo21d50

Good point, thanks. Previously I would have pretty confidently read "100K GB200 GPUs," or "100K GB200 cluster" as 200K B200s (~= 500K H100s) but I can see how it's easily ambiguous. Now that I think of it, I remembered this Tom's Hardware article where B200 and GB200 are mistakenly used interchangeably (compare the subtitle vs. the end of the first paragraph)...

[-]Vladimir_Nesov2mo350

A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.

A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).

But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be about 2.5x. Keeping compute unchanged, 2.5x fewer parameters means 2.5x more data, or 6x greater tokens/parameter ratio for a compute optimal training run.

Thus a dense model can be replaced with a 97% sparse MoE model trained using 6x less compute that will achieve the same perplexity, but the tokens/parameter ratio of this MoE model will be 6x greater than for the original dense model. Both data and active parameters would go down by 2.5x from reducing compute 6x if the ratio didn't change, but since it does change, in actuality only the number of active parameters goes down 6x, while the number of tokens stays the same.

Let's take Llama-3-405B as an example, which is a 405B parameter compute optimal model trained for 15T tokens at 40 tokens/parameter, using 4e25 FLOPs. An equivalent 97% sparse model will have 70B active parameters, 2T total parameters, and will need to be trained for the same 15T tokens to reach the same perplexity/loss at 220 tokens/parameter, using 6e24 FLOPs. (Which is close to DeepSeek-V3's 4e24-5e24 FLOPs actually, so anchoring to Llama-3-405B might be a good way of framing its compute efficiency.)

[-]ryan_greenblatt2mo52

So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.

I agree compute optimal MoEs don't improve data utilization. But, naively you might expect that MoEs can be used to reduce issues with data scarcity at a fixed level of compute by training a much bigger model on a fixed amount of data.

As in, because there are returns to both more data and bigger models, you can use MoE to effectively use a much bigger model at the same compute.

Like, maybe you would have trained llama-3-405B on 15T tokens. You could instead train an 8 trillion parameter model with 400B active params on 15T tokens and a priori this could perform much better on that same amount of data. (In practice an MoE with X active params is more expensive to train than a dense model with X active params, so you might need to reduce active params somewhat.)

[-]Vladimir_Nesov2mo30

Chinchilla scaling shows that tokens/params ratio for compute optimal models only changes slowly with compute, making it a good anchor to frame other things in terms of. The experiments from this MoE scaling paper show that under fixed data, varying sparsity in MoEs that are compute optimal at that amount of data preserves perplexity. This also seems like a nice principle for framing the way compute optimal models sit in the space of hyperparameters.

With infinite data, isoFLOPs for loss depending on number of active params are parabolas with some minimum point. But with finite data you need to repeat it to train with fewer active params, which damages loss. This moves the minima of isoFLOPs to the right if the minima already required 5x repetition or more. So under data scarcity, compute optimal models have more active params than under infinite data, and the effect gets worse with more compute. This way we maintain the framing of search for compute optimal hyperparameters rather than undertraining.

Now consider the 1e20 FLOPs plot in Figure 12, left. If there's only 2B tokens of training data and no more, all minima already ask for 12-31 epochs, so the distortion that increases loss will move the minima to the right (and up), and move the high sparsity minima further than lower sparsity minima compared to their original (infinite data) locations. The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity at 1e20 FLOPs, however you vary the number of epochs and active params! This seems counterintuitive, as in an infinite data regime more sparsity only makes things better (if we ignore practical difficulties). But sure, 90% sparsity will still be better than dense, at least until we use even more compute and sparser minima start asking for even more epochs.

[-]ryan_greenblatt2mo20

The way the isoFLOPs are shaped suggests that 90-95% sparsity might turn out to be optimal here, that is you can only get worse loss with 98+% sparsity with 1e20 FLOPs, however you vary the number of epochs and active params!

I'm currently skeptical and more minimally, I don't understand the argument you're making. Probably not worth getting into.

I do think there will be a limit to how sparse you want to even in the very high compute relative to data regime for various reasons (computational if nothing else). I don't see how these graphs support 90-95% sparsity, but I had a hard time understanding your argument.

Regardless, I don't think this argues against my claim, not sure if you were trying to argue against the claim I was saying or add context. (Insofar as your argument is true, it does limit the returns from MoE in the regime with little data.)

[-]Vladimir_Nesov2mo40

With 90% sparsity you do get better loss than dense, this is sufficient to broadly carry your argument. But with 98% sparsity (your llama-3-405B variant example has 95% sparsity) you might get worse loss than with 90% when data is scarce, though it'll still be better than dense. The principle about MoE damaging data efficiency (optimal tokens/param ratio) hints that this might be the case even before looking at the experiments.

[-]Archimedes2mo10

Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?

[-]Vladimir_Nesov3mo*353

Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.

It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.

It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers^[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters kept the same (Figure 5b), the rest is extrapolation from fitted scaling laws.

A new architecture has a compute multiplier M (at a given level of compute) if it would take M times more compute to train a compute optimal model with a reference architecture (in this case, a dense transformer) to match the perplexity it achieves when trained on data sampled from the same dataset. ↩︎

[-]Vladimir_Nesov4mo*350

New AWS Trainium 2 cluster offers compute equivalent to 250K H100s^[1], and under this assumption Anthropic implied^[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).

So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)^[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.

SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎
Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎
At 4 months, with $2/hour, this takes $300 million, which is at odds with $100 million Dario Amodei gestured at in Jun 2024, but that only applies to Claude 3.5 Sonnet, not Opus. So Opus 3.5 (if it does come out) might be a 2e26 FLOPs model, while Sonnet 3.5 a 7e25-1e26 FLOPs model. On the other hand, $2 per H100-hour is not AWS prices, at those prices Sonnet 3.5 might be capped at 4e25 FLOPs, same as Llama-3-405B. ↩︎

[-]Daniel Kokotajlo4mo123

Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.

[-]Vladimir_Nesov4mo*240

For OpenAI, there are currently 3 datacenter buildings^[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.

There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.

There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd that in Sep 2024 only the first 3 had walls, so the 4th is probably not yet done. ↩︎

[-]romeo4mo30

Thanks Vladimir, this is really interesting!

Re: OpenAI's compute, I inferred from this NYT article that their $8.7B costs this year were likely to include about $6B in compute costs, which implies an average use of ~274k H100s throughout the year^[1] (assuming $2.50/hr average H100 rental price). Assuming this was their annual average, I would've guessed they'd be on track to be using around 400k H100s by now.

So the 150k H100s campus in Phoenix might be only a small fraction of the total compute they have access to? Does this sound plausible?

The co-location of the Trainium2 cluster might give Anthropic a short term advantage, though I think its actually quite unclear if their networking and topology will fully enable this advantage. Perhaps the OpenAI Phoenix campus is well-connected enough to another OpenAI campus to be doing a 2-campus asynchronous training run effectively.

^{^}
$6e9 / 365.25d / 24h / $2.5/hr = 274k

[-]Vladimir_Nesov4mo*4-2

Training as it's currently done needs to happen within a single cluster (though this might change soon). The size of the cluster constrains how good a model can be trained within a few months. Everything that isn't training of a frontier model can happen using many smaller clusters, something like 16 to 4096 accelerators each. You can use a lot of these smaller clusters, but they can be sourced from anywhere and built piecemeal at multiple sites with smaller power allocations, while the big training cluster needs to be a single purposefully built system.

So I expect the big expenses are inference and many training experiments with smaller models. What I'm discussing here is the big cluster for training frontier models rather than the aggregate of the small clusters for other purposes. See also this comment.

the 150k H100s campus in Phoenix

Patel's claim is 100K H100s at 150 megawatts.

[-]Aaron_Scher4mo52

Training as it's currently done needs to happen within a single cluster

I think that's probably wrong, or at least effectively wrong. Gemini 1.0, trained a year ago has the following info in the technical report:

TPUv4 accelerators are deployed in “SuperPods” of 4096 chips...
TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at
Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and
inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s
network latencies and bandwidths are sufficient to support the commonly used synchronous training
paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

As you note, public distributed training methods have advanced beyond basic data parallelism (though they have not been publicly shown at large model scales because nobody has really tried yet).

[-]Vladimir_Nesov4mo*50

This might require bandwidth of about 300 Tbps for 500K B200s systems (connecting their geographically distributed parts), based on the below estimate. It gets worse with scale.

The "cluster" label applied in this context might be a bit of a stretch, for example the Llama 3 24K H100s cluster is organized in pods of 3072 GPUs, and the pods themselves are unambiguously clusters, but at the top level they are connected with 1:7 oversubscription (Section 3.3.1).

Only averaged gradients need to be exchanged at the top level, once at each optimizer step (minibatch). Llama 3 405B has about 1M minibatches with about 6 seconds per step^[1], which means latency doesn't matter, only bandwidth. I'm not sure what precision is appropriate for averaging gradients, but at 4 bytes per weight that's 1.6TB of data to be sent each way in much less than 6 seconds, say in 1 second. This is bandwidth of 12 Tbps, which fits in what a single fiber of a fiber optic cable can transmit. Overland cables are laid with hundreds of fibers, so datacenters within the US can probably get at least one fiber of bandwidth between them.

Overly large minibatches are bad for quality of training, and with H100s in a standard setup only 8 GPUs are within NVLink scaleup domains that enable tensor parallelism. If each token sequence is processed on 8 GPUs (at a given stage of pipeline parallelism), that makes it necessary to process 2K sequences at once (Llama 3 only uses 16K GPUs in its training), and with 8K tokens per sequence that's our 16M tokens per minibatch, for 1M minibatches^[2]. But if scaleup domains were larger and enabled more tensor parallelism (for an appropriately large model), there would be fewer sequences processed simultaneously for smaller minibatches, so the time between optimizer steps would decrease, from Llama 3 405B's 6 seconds down to less than that, making the necessary gradient communication bandwidth higher.

Some B200s come as NVL72 machines with 72 GPUs per scaleup domain. And with more weights there'll be more data in the gradients for those models. Llama 3 405B has 16Kx53K matrices and 8K token sequences, so at 3TB/s and 1e15 FLOP/s (in an H100), you need tiles of size at least 1000x1000 to get sufficient arithmetic intensity. The scaleup network is a bit over 3 times slower than HBM, which is almost sufficient to move along the results (and starts to fit if we increase the inner dimension, with the tiles no longer square). So as far as I understand (could be very wrong, without experience to anchor the numbers), in principle there is enough there for a bit less than 8 times 16 times 53 GPUs to work with (tiling multiplication of a 16Kx53K matrix by a 53Kx8K matrix in squares of 1Kx1K), more than 1000 of such GPUs could participate in tensor parallelism for Llama 3 405B if the network could handle it, so in particular the 72 GPUs of NVL72 are few enough that they could run such multiplications with tensor parallelism.

With 72 B200s per NVLink domain in a 500K B200s system, that's 7K sequences per minibatch, 3x more than for Llama 3 405B^[3]. The compute per second, and so per training run, is larger than with 16K H100s by a factor of 80, so by Chinchilla scaling law a dense model would be about 9 times larger, 3.5T parameters. So the model is 9x larger, processed over 9x more GPUs (per NVLink domain) that are 2.5 times faster, which means an optimizer step is 2.5 times shorter. This assumes that the sequence length stays 8K (if it's higher then so is the time between optimizer steps, reducing the necessary bandwidth). Transmitting gradients for 9x more weights in that time requires bandwidth that's 20 times higher, about 300 Tbps.

That's still within the realm of possibility, some oceanfloor cables feature bandwidth on the same order of magnitude, and overland cables should enable more, but it's no longer likely to be trivial, could require actually laying the cables between the datacenter campus sites, which could take a long time to get all the permissions and to do the construction.

16K GPUs at 40% utilization for about 4e25 dense BF16 FLOPs, which is 40% of 1e15 FLOP/s for each GPU. And 16M tokens/minibatch (Table 4) out of about 16T tokens in total. ↩︎
This gives another way of getting the estimate of 6 seconds per step, which doesn't depend on the size of the cluster at all. The compute for 1 sequence is 6 times 405B parameters times 8K tokens, processed by 8 GPUs (at some pipeline parallelism stage), each at a rate of 1e15 FLOP/s with 40% utilization on average, so it takes them 6 seconds to process a sequence. ↩︎
So making NVLink domains 9x larger only kept the problem of large minibatches from getting more than 3 times worse. This is still much better than 150K sequences per minibatch if the same compute was assembled in the form of 1200K H100s with 8 GPUs per NVLink domain. ↩︎

[-]gwern4mo*195

And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn't doing image generation, it isn't doing voice synthesis, it isn't doing video generation... (As far as we know they aren't researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That's it.

But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to 'concise' responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to 'overwrite', especially in what I was working on.)

[-]Daniel Kokotajlo4mo53

I'd guess that the amount spent on image and voice is negligible for this BOTEC?

I do think that the amount spent on inference for customers should be a big deal though. My understanding is that OpenAI has a much bigger userbase than Anthropic. Shouldn't that mean that, all else equal, Anthropic has more compute to spare for training & experiments? Such that if Anthropic has about as much compute total, they in effect have a big compute advantage?

[-]Vladimir_Nesov8d*341

Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it's less than 200K chips^[1], but more than the surprising 16K-32K chips^[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute "GPT-5" (2e27 FLOPs) by the end of 2025 would require using FP8^[3].

The Crusoe post says "initial phase, comprising two buildings at ... 200+ megawatts" and "each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s". Dylan Patel's estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it's way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).

So the "50K GB200 NVL72s" per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the "100K GPUs" per building from the Jul 2024 Crusoe post must've meant 100K compute dies (which is 50K chips). It's apparently 100-115 MW per building then, or 800-920 MW for all 8 buildings in 2026, which is notably lower than 1.2 GW the Mar 2025 Crusoe post cites.

How can the Bloomberg's 16K "GB200 semiconductors" in 2025 and 64K in 2026 be squared with this? The Mar 2025 Crusoe post says there are 2 buildings now and 6 additional buildings in 2026, for the total of 8, so in 2026 the campus grows 4x, which fits 16K vs. 64K from Bloomberg. But the numbers themselves must be counting in the units of 8 chips. This fits counting in the units of GB200 NVL8 (see at 1:13:39), which can be referred to as a "superchip". The Mar 2025 Crusoe post says Abilene site will be using NVL72 racks, so counting in NVL8 is wrong, but someone must've made that mistake on the way to the Bloomberg post.

Interpreting the Bloomberg numbers in units of 8 chips, we get 128K chips in 2025 (64K chips per building) and 512K chips in 2026 (about 7K GB200 NVL72 racks). This translates to 256-300 MW for the current 2 buildings and 1.0-1.2 GW for the 8 buildings in 2026. This fits the 1.2 GW figure from the Mar 2025 Crusoe post better, so there might be some truth to the Bloomberg post after all, even as it's been delivered in a thoroughly misleading way.

Crusoe's Jul 2024 post explicitly said "each data center building will be able to operate up to 100,000 GPUs", and in 2024 "GPU" usually meant chip/package (in 2025, it's starting to mean "compute die", see at 1:28:04; there are 2 compute dies per chip in GB200 systems). Which suggested 200K chips for the initial 2 buildings. ↩︎
The post said it's the number of "coveted GB200 semiconductors", which is highly ambiguous because of the die/chip/superchip counting issue. A "GB200 superchip" means 2 chips (plus a CPU) by default, so 16K superchips would be 32K chips. ↩︎
A GB200 chip (not die or superchip) produces 2.5e15 dense BF16 FLOP/s (2.5x more than an H100 chip). Training at 40% utilization for 3 months, 100K chips produce 8e26 FLOPs. But in FP8 it's 1.6e27 FLOPs. Assuming GPT-4 was 2e25 FLOPs, 100x its raw compute asks "GPT-5" to need about 2e27 FLOPs. In the OpenAI's introductory video about GPT-4.5, there was a hint it might've been trained in FP8 (at 7:38), so it's not implausible that GPT-5 would be trained in FP8 as well. ↩︎

[-]Vladimir_Nesov2mo310

Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.

OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.

Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.

[-]anaguma2mo51

How does Anthropic and XAi’s compute compare over this period?

[-]Vladimir_Nesov2mo220

What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that

I would not be surprised if in 2026 we have more than a million of some kind of chip.

Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.

[-]Lorenzo2mo20

For context, average US electricity consumption in 2022 was ~500GW. So these would be ~1% of all US electricity consumption (as an order of magnitude)

[-]Vladimir_Nesov1mo276

GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk's Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can't train a 2e27 FLOPs model.

The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there's already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI's pace, and go into pretraining for 4 months, then with 1 more month of post-training it's already November.

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.

[-]Thane Ruthenis1mo123

So the rumors about GPT-5 in late May 2025 either represent change in the naming convention

Per Altman:

In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.

I think he's pretty plainly saying that this "GPT-5" will be a completely different thing from a 100x'd GPT-4.

[-]Vladimir_Nesov1mo31

This is perfectly consistent with GPT-5 being 100x GPT-4 compute. Announcing specific features that will go into it suggests they have a prototype, in this case I'm guessing the LLM will itself be trained to decide whether to go into the reasoning mode, triggering it when needed and affordable, like any other tool.

[-]Thane Ruthenis1mo30

I don't see it. He says that GPT-5 will be a system that "integrates o3". This isn't his sloppy way of saying "integrates the reasoning techniques": when he wants to express that idea, he talks about "unifying o-series models and GPT-series models". The wording regarding GPT-5 is consistent with him literally saying that the model o3 will be part of GPT-5.

Furthermore, I take "as" in "GPT-5 as a system that integrates a lot of our technology" to mean "GPT-5 is defined as {a system that integrates a lot of our technology, including o3}". Not "GPT-5 will be trained to automatically switch between a standard mode, a reasoning mode, a Deep Research mode, etc.", not even "GPT-5 will be trained to recognize when to fall back to o3, a lesser model", but literally "we're slapping the GPT-5 label on a glorified wrapper over all our current models".

[-]Vladimir_Nesov1mo50

The "glorified wrapper" could still be a 2e27 FLOPs model, it could even be using literal o3 as one of its tools (in addition to all the other tools, with native GPT-5 long reasoning mostly reserved for premium tier). This is in line with the "agents" agenda where better reliability in taking irreversible actions unlocks new use cases, in this case whether to make use of expensive reasoning calls.

Since "GPT-4.5" will actually be released rather than skipped, it's less plausible for "GPT-5" to come out shortly after. If it's announced in ~Dec 2025 (the way o3 was), it's still "within months", and then it can actually get released in ~Feb 2026.

[-]Thane Ruthenis1mo20

Hm, fair enough. Seems like a stretch, though, especially given the need to interpret his "ETA in months" as "will be officially announced in months and released in a year".

[-]Vladimir_Nesov1mo50

There was also Murati in Jun 2024 predicting PhD level AI in 18 months. If they succeed in achieving parity with xAI in terms of safety procedures, they might even release a preview checkpoint in Dec 2025 for Pro users. So actual release in a year is not strictly necessary for this hypothesis, it's just closer to what they've done in the past.

[-]Josh You1mo108

if OpenAI follows the usual naming convention of roughly 100x in raw compute.

I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

[-]Vladimir_Nesov1mo40

I'm merely referring to the historical precedent, whether there are informal commitments in the minds of the leadership is not something I can speak to. This pattern might continue or it might break. What I'm guessing about training system buildout from vague clues seems to be consistent with it continuing, so the naming pattern can be used as another clue to make a point estimate prediction that's more concrete.

[-]Vladimir_Nesov2mo270

Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.

The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems^[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.

OpenAI would want to get from under Microsoft's thumb anyway^[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is some evidence of slowdown, since it only motivates saying you want to build frontier training systems even faster, but doesn't in itself motivate actually going through with it, beyond building a competitive training system that makes you independent.

So the clues that support the prospect of scaling to 1 GW in 2025 and to 5 GW in 2027 could be misleading, running contrary to hyperscaler attitudes and not aligning even with OpenAI's immediate incentives.

I previously expected that $80bn is evidence that they are building a large training system this year, but it now seems that they are building more inference instead. ↩︎
As Satya Nadella said, "If OpenAI disappeared tomorrow... we have all the IP rights and all the capability. We have the people, we have the compute, we have the data, we have everything. We are below them, above them, around them." ↩︎

[-]Vladimir_Nesov5mo*100

Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters^[1]. It's above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It's above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.

The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎

[-]Vladimir_Nesov5mo*40

Kai-Fu Lee, CEO of 01 AI, posted on LinkedIn:

Yi-Lightning is a small MOE model that is extremely fast and inexpensive. Yi-Lightning costs only $0.14 (RMB0.99 ) /mil tokens [...] Yi-Lightning was pre-trained on 2000 H100s for 1 month, costing about $3 million, a tiny fraction of Grok-2.

Assuming it's trained in BF16 with 40% compute utilization, that's a 2e24 FLOPs model (Llama-3-70B is about 6e24 FLOPs, but it's not MoE, so the FLOPs are not used as well). Assuming from per token price that it has 10-20B active parameters, it's trained on 15-30T tokens. So not an exercise in extreme compute scaling, just excellent execution.

[-]Vladimir_Nesov2mo*40

A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.

The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the outer agent's cognition don't even need to have its safety properties. This is one framing for the way people might live within a superintelligence.

[-]CstineSublime2mo10

Are there pivotal ways this is different to the theories of Enactivism?
(" Its authors define cognition as enaction, which they in turn characterize as the ‘bringing forth’ of domains of significance through organismic activity that has been itself conditioned by a history of interactions between an organism and its environment." which at first blush I'd say is a reflectively stable agent modifying or updating believes by means of enaction. Enactivism also rejects mind-body duality in favour of a more 'embodied' cognition approach together with a "deep continuity of the principles of self-organization from the simplest living things to more complex cognitive beings"), particularly autopoeisis.

"An autopoietic system was defined as a network of inter-related component-producing processes such that the components in interaction generate the same network that produced them."

An autopoietic system can be contrasted to an allopoetic system which creates objects different to itself, like a factory. Most living beings are autopoetic in that they either produce themselves, or things like them which seems to be similar to a reflectively stable agent, particularly when we describe the more complicated cognitive beings in autopoetic terms. Luhman argued that social systems too are self-organizing, self-reproducing systems which brought the concepts of enactivism from biology and cognitive science into the social sciences.

Moderation Log