Recursive self-improvement in AI probably comes before AGI. Evolution doesn't need to understand human minds to build them, and a parent doesn't need to be an AI researcher to make a child. The bitter lesson and the practice of recent years suggest that building increasingly capable AIs doesn't depend on understanding how they think.
Thus the least capable AI that can build superintelligence without human input only needs to be a competent engineer that can scale and refine a sufficiently efficient AI design, in an empirically driven mundane way that doesn't depend on matching capabilities of Grothendieck for conceptual invention. This makes the threshold of AGI less relevant for timelines of recursive self-improvement than I previously expected. With o1 and what straightforwardly follows, we plausibly already have all it takes to get recursive self-improvement, if the current designs get there with the next few years of scaling, and the resulting AIs are merely competent engineers that fail to match humans at less legible technical skills.
No group of AIs needs to gain control before human irrelevance either. Like a runaway algal bloom AIs might be able to bootstrap superintelligence, without crossing the threshold of AGI being useful in helping them gain control over this process any more than humans maintain such control at the outset. So it's not even multi-agent dynamics shaping the outcome, capitalism might just serve as the nutrients until a much higher threshold of capability where a superintelligence can finally take control of this process.
Humans are capable of solving conceptually difficult problems, so they do. An easier path might be possible that doesn't depend on such capabilities, and doesn't stall for their lack, like evolution doesn't stall for lack of any mind at all. If there is more potential for making models smarter alien tigers by scaling RL in o1-like post-training, and the scaling proceeds to 1 gigawatt and then 35 gigawatt training systems, it might well be sufficient to get an engineer AI that can improve such systems further, at 400x and then 10,000x the compute of GPT-4.
Before o1, there was a significant gap, the mysterious absence of System 2 capabilities, with only vague expectation that they might emerge or become easier to elicit from scaled up base models. This uncertainty no longer gates engineering capabilities of AIs. I'm still unsure that scaling directly can make AIs capabile of novel conceptual thought, but AIs becoming able to experimentally iterate on AI designs seems likely, and that in turn seems sufficient to eventually mutate these designs towards remaining missing capabilities.
(It's useful to frame most ideas as exploratory engineering rather than forecasting. The question of whe...
Cutting edge AI research seems remarkably and surprisingly easy compared to other forms of cutting edge science. Most things work on the first try, clever insights aren't required, it's mostly an engineering task of scaling compute.
The speed of scaling pretraining will go down ~3x in 2027-2029, reducing probability of crossing transformative capability thresholds per unit of time after that point, if they'd not been crossed yet by then.
GPT-4 was trained in 2022 at ~2e25 FLOPs, Grok-3 and GPT-4.5 were trained in 2024 at ~3e26 FLOPs (or twice that in FP8) using ~100K H100s training systems (which cost ~$4-5bn to build). In 2026, Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks (which cost ~$22-35bn to build), enough to train a ~4e27 FLOPs model. Thus recently there is a 2-year ~6x increase in cost for a frontier training system and a 2-year ~14x increase in compute. But for 2028 this would mean a $150bn training system (which is a lot, so only borderline plausible), and then $900bn in 2030. At that point AI companies would need to either somehow figure out how to pool resources, or pretraining will stop scaling before 2030 (assuming AI still doesn't hit a transformative commercial success).
If funding stops increasing, what we are left with is the increase in price performance of ~2.2x every 2 years, which is ~3.3x slower than the 2-year ~14x at the current pace. (I'm estimating price performance for a whole datacenter or at least a rack, rather than only for chips.)
A surprising report by Bloomberg claims 16K GB200[1] by summer 2025 at Abilene site (pilot campus of Stargate) and merely 64K GB200 by end of 2026. This is way too little to be a training system, Colossus already has more compute (200K H100/H200) than the projected 64K GB200 at end of 2026.
If this is correct, OpenAI will be training with Azure rather than Stargate in 2025, so raw compute GPT-5 (2e27 FLOPs, 100x GPT-4) probably won't be out in 2025 and officially "GPT-5" will mean something else (since it's due "in months" in any case according to Altman). Also, a datacenter with 16K Blackwells only costs about $1bn, they have more money than this, which suggests Blackwell ramp up trouble that might delay everyone else as well, though as a lower bound Nvidia reported $11bn in Blackwell sales for Nov 2024 - Jan 2025 (it's "Q4 2025" since their FY 2025 runs to end of Jan 2025).
In principle "16K GB200" might mean more Blackwell chips than 16K, a compute tray has more than one chip, with variants marketed as named products like GB200 NVL4 "superchip", but even at 4 chips per tray/board we still get below 200K H100s in compute. And an NVL72 system has 72 chips (which brings the numbe
I think 'GB200' refers to this column (2 Blackwell GPU + 1 Grace CPU) so 16K GB200s ~= 32K B200s ~= 80K H100s. Agree that it is still very low.
My guess is that Bloomberg's phrasing is just misleading or the reporting is incomplete. For example, maybe they are only reporting the chips Oracle is contributing or something like that. I'd be very surprised if OpenAI don't have access to >200K GB200s ~= 1M H100s by the end of 2025. For reference, that is only ~$20B capex (assuming $100k total cost of ownership per GB200) or roughly 1/4 of what Microsoft alone plan to invest this year.
Once they have just 100K GB200s, that should train 2e27 FLOP in 4 months.[1]
There's a nice correspondence between H100s and FLOP/month (assuming 40% utilisation and 16-bit precision) of 1e21 FLOP/month/H100. So since 100K GB200s = 500K H100s, that's 5e26 FLOP/month.
The marketing terminology is inconvenient, a "superchip" can mean 2-GPU or 4-GPU boards and even a 72-GPU system (1 or possibly 2 racks). So it's better to talk in terms of chips (that are not "superchips"), which I think are all B200 run at slightly different clock speeds (not to be confused with B200A/B102/B20 that have 2 times less compute). In GB200, the chips are 2.5x faster than H100/H200 (not 5x faster; so a 200K chip GB200 system has the same compute as a 500K chip H100 system, not a 1M chip H100 system). Power requirements are often a good clue that helps disambiguate, compute doesn't consistently help because it tends to get reported at randomly chosen precision and sparsity[1].
Large scale-up worlds (or good chips) are not necessarily very important in pretraining, especially in the later steps of the optimizer when the critical batch size gets high enough, so it's not completely obvious that a training system will prefer to wait for NVL72 even if other packagings of Blackwell are more available earlier. Inference does benefit from NVL72 a lot, but for pretraining it's just cheaper per FLOP than H100 and faster in wall clock time during the first ~3T tokens when the whole...
Abilene site of Stargate will host 100K-128K chips in GB200 NVL72 racks by this summer, and a total of 400K-512K chips in 2026, based on a new post by Crusoe and a reinterpretation of the recent Bloomberg post in light of the Crusoe post. For 2025, it's less than 200K chips[1], but more than the surprising 16K-32K chips[2] that the Bloomberg post suggested. It can be a training system after all, but training a raw compute "GPT-5" (2e27 FLOPs) by the end of 2025 would require using FP8[3].
The Crusoe post says "initial phase, comprising two buildings at ... 200+ megawatts" and "each building is designed to operate up to 50,000 NVIDIA GB200 NVL72s". Dylan Patel's estimate (at 1:24:42) for all-in power per Blackwell GPU as a fraction of the datacenter was 2.0 KW (meaning per chip, or else it's way too much). At GTC 2025, Jensen Huang showed a slide (at 1:20:52) where the estimate is 2.3 KW per chip (100 MW per 85K dies, which is 42.5K chips).
So the "50K GB200 NVL72s" per building from the Mar 2025 Crusoe post can only mean the number of chips (not dies or superchips), and the "100K GPUs" per building from the Jul 2024 Crusoe post must've meant 100K compute dies (which is 50K chips). It...
A MoE transformer can reach the same loss as a compute optimal dense model using 3x-6x less compute, but will need the same amount of data to do it. So compute optimal MoEs don't improve data efficiency, don't contribute to mitigating data scarcity.
A new Jan 2025 paper offers straightforward compute multiplier results comparing dense transformers to MoE at various levels of sparsity, with isoFLOPs for various tokens/parameter ratios, using experiments of up to 1e21 FLOPs per datapoint. Compute multiplier results are in Figure 11, with about 3x compute multiplier for 87% (1:8) sparse MoE over dense, and about 6x-7x compute multiplier for 97% (1:32) sparse MoE (same sparsity as DeepSeek-V3).
But there's a catch. Greater sparsity makes it compute optimal to use fewer active parameters, and therefore more data (training with the same compute). This can be seen on isoFLOP plots in Figure 12, left. As sparsity goes from 0% (dense) to 95% (1:20), compute optimal number of active parameters for their 1e21 FLOPs experiments goes from 2.9B to 1.3B. For 97% (1:32) sparsity, interpolating from experiments on the other compute budgets, the ratio of the number of active parameters seems to be abo...
Chatbot Arena results for DeepSeek-V3 are in. It placed 7th in Overall w/ Style Control, tied with Claude-3.5.Oct-Sonnet, and 3rd in Hard Prompts w/ Style Control, tied with Gemini-2.0-Flash and behind only Claude-3.5.Oct-Sonnet, mysterious Gemini-Exp-1206, o1, and Gemini-2.0-Flash-Thinking.
It's a MoE model with 37B active parameters trained for about 5e24 FLOPs, 10x less compute than Llama-3-405B, 20x less than what could plausibly be extracted from 30K H100s in BF16. The pretraining data is about 15T tokens, so at 400 tokens per active parameter it's very overtrained, that is not even compute optimal.
It has 256 routed experts per layer, 8 of which get activated per token. These results give some weight to the Feb 2024 paper that predicts that using more granular experts and activating a lot of them per token can give shocking compute multipliers[1], up to 20x-30x, much more than for MoE transformers that only activate 1-2 routed experts per token (Figure 1b). The paper itself only does experiments of up to about 5e19 FLOPs, in particular directly demonstrating a compute multiplier of 2x from using 8 experts per token instead of 2, with the numbers of total and active parameters k...
New AWS Trainium 2 cluster offers compute equivalent to 250K H100s[1], and under this assumption Anthropic implied[2] their previous compute was 50K H100s (possibly what was used to train Claude 3.5 Opus).
So their current or imminent models are probably 1e26-2e26 FLOPs (2-4 months on 50K H100s at 40% compute utilization in BF16)[3], and the upcoming models in mid to late 2025 will be 5e26-1e27 FLOPs, ahead of what 100K H100s clusters of other players (possibly except Google) can deliver by that time.
SemiAnalysis gives an estimate of 24-27 kilowatts per 32 Trainium 2 chips, so 200K Trn2s need 150 megawatts. The 7 datacenter buildings in the northern part of the New Carlisle AWS site are 65 megawatts each according to SemiAnalysis. That's enough for 600K Trn2s, so the figure of 400K Trn2s probably refers to those buildings alone, rather than also to the second phase of the project scheduled for next year. At 0.65e15 dense BF16 FLOP/s each, 400K Trn2s produce as much compute as 250K H100s. ↩︎
Anthropic's post: "This cluster will deliver more than five times the computing power used to train our current generation of leading AI models." ↩︎
At 4 months, with $2/hour, this takes $3
Are you saying Anthropic actually has more compute (in the relevant sense) than OpenAI right now? That feels like a surprising claim, big if true.
For OpenAI, there are currently 3 datacenter buildings[1] near Phoenix Goodyear Airport that Dylan Patel is claiming are 48 megawatts each and filled with H100s, for about 100K H100s. This probably got online around May 2024, the reason for the announcement and the referent of Kevin Scott's blue whale slide.
There are claims about a future cluster of 300K B200s and a geographically distributed training system of 500K-700K B200s, but with B200s deliveries in high volume to any given customer might only start in early to mid 2025, so these systems will probably get online only towards end of 2025. In the meantime, Anthropic might have a lead in having the largest cluster, even if they spend less on compute for smaller experiments overall. It might take a while to get it working, but there might be a few months there. And given how good Claude 3.5 Sonnet is, together with the above musings on how it's plausibly merely 4e25 FLOPs based on Dario Amodei's (somewhat oblique) claim about cost, additionally getting compute advantage in training a frontier model could carry them quite far.
There are 4.5 buildings now at that site, but you can see with Google Street View from Litchfield Rd
And in a way, they ought to be rolling in even more compute than it looks because they are so much more focused: Anthropic isn't doing image generation, it isn't doing voice synthesis, it isn't doing video generation... (As far as we know they aren't researching those, and definitely not serving it to customers like OA or Google.) It does text LLMs. That's it.
But nevertheless, an hour ago, working on a little literary project, I hit Anthropic switching my Claude to 'concise' responses to save compute. (Ironically, I think that may have made the outputs better, not worse, for that project, because Claude tends to 'overwrite', especially in what I was working on.)
Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!
Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover point reached with a small amount of RL training. (Would be interesting to know how the pass@k plots depend on the number of reasoning tokens, for models that allow control over the reasoning budget.)
A task is solved at pass@k if an oracle verifier claims at least one of k sampled solutions to be correct. See Figure 3, left in this Jul 2024 paper for how pass@k affects performance, depending on the model. ↩︎
o3 has a different base model (presumably).
All of the figures are base model equivalated between RL and not
I would expect "this paper doesn't have the actual optimal methods" is true, this is specifically a test for PPO for in distribution actions. Concretely, there is a potential story here about PPO reinforces traces that hit in self-play, consequently, there is a sense which we would expect it to only select previously on policy actions.
But if one has enough money, you can finetune GPT models, and test that.
Also note that 10k submissions is about 2 OOM out of distribution for the charts in the paper.
Pass at inf k includes every path with nonzero probability (if there is a policy of discarding exact repeat paths).
We know that RL decreases model entropy, so the first k passes will be more different for a high variance model.
Pass at k is take best, so for normal distribution take best has EV mean+variance*log(samples).
At very large K, we would expect variance to matter more than mean.
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
GPT-5 should be released late 2025 at the earliest if OpenAI follows the usual naming convention of roughly 100x in raw compute. With GPT-4 at 2e25 FLOPs, GPT-4.5 should have about 2e26 FLOPs and GPT-5 about 2e27 FLOPs. A 100K H100 training system, like the one in Goodyear (or Musk's Memphis datacenter as it was late 2024), can train a 3e26 FLOPs model, which fits the name of GPT-4.5, but it can't train a 2e27 FLOPs model.
The new Stargate site in Abilene might be preparing to host 200K-300K chips in GB200 NVL72 racks. These chips produce 2.5x more compute than H100s, so 200K would be sufficient to get 2e27 FLOPs and train a GPT-5. If there's already enough power (about 400 MW all-in for 200K chips), shipments of GB200 in bulk start in early 2025, get installed at xAI's pace, and go into pretraining for 4 months, then with 1 more month of post-training it's already November.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention, or correspond to some intermediate milestone in training GPT-5, likely the training system being in principle ready to start pretraining.
So the rumors about GPT-5 in late May 2025 either represent change in the naming convention
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
I think he's pretty plainly saying that this "GPT-5" will be a completely different thing from a 100x'd GPT-4.
if OpenAI follows the usual naming convention of roughly 100x in raw compute.
I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.
Stargate is evidence towards slower training system scaling. The rumored reason for starting the project is that Microsoft isn't building giant frontier training systems fast enough, probably because they aren't seeing the case for doing that faster. In which case other hyperscalers might think similarly, and they are the most well-positioned to build these systems, so this attitude might be indicative of how frontier training systems get built overall, which is notably slower than technically feasible.
The $80bn Microsoft capex is not relevant to this if it goes to many smaller systems[1], which is only natural as there are millions of datacenter GPUs but only a few 100K GPU frontier training systems, a tiny fraction of inference and smaller/research training compute. The $500bn figure is not relevant as for now it's only a vague plan. But Microsoft not agreeing to build training systems on OpenAI's schedule is some evidence.
OpenAI would want to get from under Microsoft's thumb anyway[2], and this gets ever more difficult over time, since frontier training systems get ever more expensive, so the sooner they try the more likely they are to succeed. But even this consideration is som...
When people are skeptical about the concept of AGI being meaningful or having clear boundaries, it could sometimes be downstream of skepticism about very fast and impactful R&D done by AIs, such as software-only singularity or things like macroscopic biotech where compute buildout happens at a speed impossible for human industry. Such events are needed to serve as landmarks, anchoring a clear concept of AGI, otherwise the definition remains contentious.
So AI company CEOs who complain about AGI being too nebulous to define might already be expecting a scaling slowdown, with their strategy being primarily about the fight for the soul of the 2028-2030 market. When scaling is slow, it'll become too difficult to gain a significant quality advantage sufficient to defeat the incumbents. So the decisive battle is happening now, with the rhetoric making it more palatable to push through the decisions to build the $140bn training systems of 2028.
This behavior doesn't need to be at all related to expecting superintelligence, it makes sense as a consequence of not expecting superintelligence in the near future.
Yi-Lightning (01 AI) Chatbot Arena results are suprisingly strong for its price, which puts it at about 10B active parameters[1]. It's above Claude 3.5 Sonnet and GPT-4o in Math, above Gemini 1.5 Pro 002 in English and Hard Prompts (English). It's above all non-frontier models in Coding and Hard Prompts (both with Style Control), including Qwen-2.5-72B (trained on 18T tokens). Interesting if this is mostly a better methodology or compute scaling getting taken more seriously for a tiny model.
The developer's site says it's a MoE model. Developer's API docs list it at ¥0.99/1M tokens. The currency must be Renminbi, so that's about $0.14. Together serves Llama-3-8B for $0.10-0.18 (per million tokens), Qwen-2.5-7B for $0.30, all MoE models up to 56B total (not active) parameters for $0.60. (The prices for open weights models won't have significant margins, and model size is known, unlike with lightweight closed models.) ↩︎
Cultural/moral maturity (in a civilization) has never been observed before, similarly to technological maturity. Scalable production of a new kind of thing brings its abundance in sight, which fails to be a concern earlier, while it couldn't be scaled. A moderate level of AI alignment or of cultural change is not an equilibrium if these things are anchored to scalable resources (effective cognition and coordination, fast subjective serial time). Instead they reach extremes of the kind never observed before those resources become scalable.
Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.
A reflectively stable agent prefers to preserve some property of itself. This doesn't in general prevent it from being able to self-improve, in the same way that unchanging laws of physics don't prevent presence of self-improving agents in the world.
The content of the world keeps changing under the unchanging laws of how it changes, and similarly a reflectively stable agent (against safety properties) has content (such as beliefs) that keeps changing, in principle enabling unfettered self-improvement. Mesa-agents existing in the form of the content of the ...