All of Vladimir_Nesov's Comments + Replies

In the hypothetical where the paper's results hold, reasoning model performance at pass@k will match non-reasoning model performance with the number of samples closer to the crossover point between reasoning and non-reasoning pass@k plots. If those points for o1 and o3 are somewhere between 50 and 10K (say, at ~200), then pass@10K for o1 might be equivalent to ~pass@400 for o1's base model (looking at Figure 2), while pass@50 for o3 might be equivalent to ~pass@100 for its base model (which is probably different from o1's base model).

So the difference of 2... (read more)

1mrtreasure
If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?

It's evidence to the extent that the mere fact of publishing Figure 7 (hopefully) suggests that the authors (likely knowing relevant OpenAI internal research) didn't expect that their pass@10K result for the reasoning model is much worse than the language monkey pass@10K result for the underlying non-reasoning model. So maybe it's not actually worse.

Long reasoning training might fail to surpass pass@50-pass@400 capabilities of the base/instruct model. A new paper measured pass@k[1] performance for models before and after RL training on verifiable tasks, and it turns out that the effect of training is to lift pass@k performance at low k, but also to lower it at high k!

Location of the crossover point varies, but it gets lower with more training (Figure 7, bottom), suggesting that no amount of RL training of this kind lets a model surpass the pass@k performance of the base/instruct model at the crossover... (read more)

2Thane Ruthenis
Huh. This is roughly what I'd expected, but even I didn't expect it to be so underwhelming.[1] I weakly predict that the situation isn't quite as bad for capabilities as this makes it look. But I do think something-like-this is likely the case. 1. ^ Of course, moving a pass@400 capability to pass@1 isn't nothing, but it's clearly astronomically short of a Singularity-enabling technique that RL-on-CoTs is touted as.
5ryan_greenblatt
This seems relatively clearly false in the case of competition programming problems. Concretely, o3 with 50 submissions beats o1 with 10k submissions. (And o1 is presumably much better than the underlying instruct model.) I'd guess this paper doesn't have the actual optimal methods.

The state of the geopolitical board will influence how the pre-ASI chaos unfolds, and how the pre-ASI AGIs behave. Less plausibly intentions of the humans in charge might influence something about the path-dependent characteristics of ASI (by the time it takes control). But given the state of the "science" and lack of the will to be appropriately cautious and wait a few centuries before taking the leap, it seems more likely that the outcome will be randomly sampled from approximately the same distribution regardless of who sets off the intelligence explosion.

For me the main update from o3 is that since it's very likely GPT-4.1 with reasoning and is at Gemini 2.5 Pro level, the latter is unlikely to be a GPT-4.5 level model with reasoning. And so we still have no idea what a GPT-4.5 level model with reasoning can do, let alone when trained to use 1M+ token reasoning traces. As Llama 4 was canceled, irreversible proliferation of the still-unknown latent capabilities is not yet imminent at that level.

the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI

Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.

Consider what it means for power to "never concentrate to an an extreme degree", as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.

(The relevance is that whatever the plans are, they need to be grounded in what's technically feasible, and this piece of news changed my mind on what might be technically feasible in 2026 on short notice. The key facts are systems with a large scale-up world size, and enough compute dies to match the compute of Abilene site in 2026, neither of which was obviously possible without more catch-up time, by which time the US training systems would've already moved on to an even greater scale.)

There are new Huawei Ascend 910C CloudMatrix 384 systems that form scale-up worlds comparable to GB200 NVL72, which is key to being able to run long reasoning inference for large models much faster and cheaper than possible using systems with significantly smaller world sizes like the current H100/H200 NVL8 (and also makes it easier to run training, though not as essential unless RL training really does scale to the moon).

Apparently TSMC produced ~2.1M compute dies for these systems in 2024-2025, which is 1.1M chips, and an Ascend 910C chip is 0.8e15 dense... (read more)

9Vladimir_Nesov
(The relevance is that whatever the plans are, they need to be grounded in what's technically feasible, and this piece of news changed my mind on what might be technically feasible in 2026 on short notice. The key facts are systems with a large scale-up world size, and enough compute dies to match the compute of Abilene site in 2026, neither of which was obviously possible without more catch-up time, by which time the US training systems would've already moved on to an even greater scale.)

Economics studies the scaling laws of systems of human industry. LLMs and multicellular organisms and tokamaks have their own scaling laws, the constraints ensuring optimality of their scaling don't transfer between these very different machines. A better design doesn't just choose more optimal hyperparameters or introduce scaling multipliers, it can occasionally create a new thing acting on different inputs and outputs, scaling in its own way, barely noticing what holds back the other things.

My first impression of o3 (as available via Chatbot Arena) is that when I'm showing it my AI scaling analysis comments (such as this and this), it responds with confident unhinged speculation teeming with hallucinations, compared to the other recent models that usually respond with bland rephrasings that get almost everything correctly with a few minor hallucinations or reasonable misconceptions carrying over from their outdated knowledge.

Don't know yet if it's specific to speculative/forecasting discussions, but it doesn't look good (for faithfulness of a... (read more)

3Paragox
The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section: If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answers, that the swe-lancer manager benchmark represents.  So the finite set of activated weights is likely getting canabalized further from pre-training generality, towards the increasingly evident fragility of RLed tasks. However, I feel it is also decent evidence towards the perennial question of activated weight size re. o1 vs o3, and that o3 is not yet the model designed to consume the extensive (yet expensive) shared world size capacity of OAI's shiny new NV72 racks.    As a separate aside, it was amusing setting o3 off to work on re-graphing this data into a sane format, and observing the RL-ed tool-use fragility: it took 50+ tries over 15 minutes of repeatedly failed panning, cropping, and zooming operations for it to diligently work out an accurate data extraction, but work out an extraction it did. Inference scaling in action!  

Will Brown: it's simple, really. GPT-4.1 is o3 without reasoning ... o1 is 4o with reasoning ... and o4 is GPT-4.5 with reasoning.

Price and knowledge cutoff for o3 strongly suggest it's indeed GPT-4.1 with reasoning. And so again we don't get to see the touted scaling of reasoning models, since the base model got upgraded instead of remaining unchanged. (I'm getting the impression that GPT-4.5 with reasoning is going to be called "GPT-5" rather than "o4", similarly to how Gemini 2.5 Pro is plausibly Gemini 2.0 Pro with reasoning.)

In any case, the fact t... (read more)

3Rasool
Does this match your understanding?   AI CompanyPublic/Preview NameHypothesized Base ModelHypothesized EnhancementNotesOpenAIGPT-4oGPT-4oNone (Baseline)The starting point, multimodal model.OpenAIo1GPT-4oReasoningFirst reasoning model iteration, built on the GPT-4o base. Analogous to Anthropic's Sonnet 3.7 w/ Reasoning.OpenAIGPT-4.1GPT-4.1NoneAn incremental upgrade to the base model beyond GPT-4o.OpenAIo3GPT-4.1ReasoningPrice/cutoff suggest it uses the newer GPT-4.1 base, not GPT-4o + reasoning.OpenAIGPT-4.5GPT-4.5NoneA major base model upgradeOpenAIGPT-5GPT-4.5Reasoning"GPT-5" might be named this way, but technologically be GPT-4.5 + Reasoning.AnthropicSonnet 3.5Sonnet 3.5NoneExisting model.AnthropicSonnet 3.7 w/ ReasoningSonnet 3.5ReasoningBuilt on the older Sonnet 3.5 base, similar to how o1 was built on GPT-4o.AnthropicN/A (Internal)Newer SonnetNoneInternal base model analogous to OpenAI's GPT-4.1.AnthropicN/A (Internal)Newer SonnetReasoningInternal reasoning model analogous to OpenAI's "o3".AnthropicN/A (Internal)Larger OpusNoneInternal base model analogous to OpenAI's GPT-4.5.AnthropicN/A (Internal)Larger OpusReasoningInternal reasoning model analogous to hypothetical GPT-4.5 + Reasoning.GoogleN/A (Internal)Gemini 2.0 ProNonePlausible base model for Gemini 2.5 Pro according to the author.GoogleGemini 2.5 ProGemini 2.0 ProReasoningAuthor speculates it's likely Gemini 2.0 Pro + Reasoning, rather than being based on a GPT-4.5 scale model.GoogleN/A (Internal)Gemini 2.0 UltraNoneHypothesized very large internal base model. Might exist primarily for knowledge distillation (Gemma 3 insight).

To me these kinds of failures feel more "seem to be at the core of the way LLMs reason".

Right, I was more pointing out that if the analogy holds to some extent, then long reasoning training is crucial as the only locus of feedback (and also probably insufficient in current quantities relative to pretraining). The analogy I intended is this being a perception issue that can be worked around without too much fundamental difficulty, but only with sufficient intentional caution. Humans have the benefit of lifelong feedback and optimization by evolution, so ... (read more)

4Kaj_Sotala
Right, that sounds reasonable. One thing that makes me put less probability in this is that at least so far, the domain where reasoning models seem to shine are math/code/logic type tasks, with more general reasoning like consistency in creative writing not benefiting as much. I've sometimes enabled extended thinking when doing fiction-writing with Claude and haven't noticed a clear difference. That observation would at least be compatible with the story where reasoning models are good on things where you can automatically generate an infinite number of problems to automatically provide feedback on, but less good on tasks outside such domains. So I would expect reasoning models to eventually get to a point where they can reliably solve things in the class of the sliding square puzzle, but not necessarily get much better at anything else. Though hmm. Let me consider this from an opposite angle. If I assume that reasoning models could perform better on these kinds of tasks, how might that happen? * What I just said: "Though hmm. Let me consider this from an opposite angle." That's the kind of general-purpose thought that can drastically improve one's reasoning, and that the models could be taught to automatically do in order to e.g. reduce sycophancy. First they think about the issue from the frame that the user provided, but then they prompt themselves to consider the exact opposite point and synthesize those two perspectives. * There are some pretty straightforward strategies for catching the things in the more general-purpose reasoning category: * Following coaching instructions - teaching the model to go through all of the instructions in the system prompt and individually verify that it's following each one. Could be parallelized, with different threads checking different conditions. * Writing young characters - teaching the reasoning model to ask itself something like "is there anything about this character's behavior that seems unrealistic given what

the fact that e.g. GPT-4.5 was disappointing

It's not a reasoning variant though, the only credible reasoning model at the frontier ~100K H100s scale that's currently available is Gemini 2.5 Pro (Grok 3 seems to have poor post-training, and is suspiciously cheap/fast without Blackwell or presumably TPUs, so likely rather overtrained). Sonnet 3.7 is a very good GPT-4 scale reasoning model, and the rest are either worse or trained for even less compute or both. These weird failures might be analogous to optical illusions (but they are textual, not known to... (read more)

2Kaj_Sotala
Yeah, to be clear in that paragraph I was specifically talking about whether scaling just base models seems enough to solve the issue. I discussed reasoning models separately, though for those I have lower confidence in my conclusions. To me "analogous to visual illusions" implies "weird edge case". To me these kinds of failures feel more "seem to be at the core of the way LLMs reason". (That is of course not to deny that LLMs are often phenomenally successful as well.) But I don't have a rigorous argument for that other than "that's the strong intuition I've developed from using them a lot, and seeing these kinds of patterns repeatedly".

I see what you mean (I did mostly change the topic to the slowdown hypothetical). There is another strange thing about AI companies, I think giving ~50% in cost of inference too much precision in the foreseeable future is wrong, as it's highly uncertain and malleable in a way that's hard for even the company itself to anticipate.

About ~2x difference in inference cost (or size of a model) can be merely hard to notice when nothing substantial changes in the training recipe (and training cost), and better post-training (which is relatively cheap) can get that... (read more)

2Randaly
Thanks for explaining. I now agree that the current cost of inference isn't a very good anchor for future costs in slowdown timelines. I'm uncertain, but I still think OpenAI is likely to go bankrupt in slowdown timelines. Here are some related thoughts: 1. OpenAI probably won't pivot to the slowdown in time. 1. They'd have < 3 years to do before running out of money. 2. Budgets are set in advance. So they'd have even less time. 3. All of the improvements you list cost time and money. So they'd need to continue spending on R&D, before that R&D has improved their cost of inference. In practice, they'd need to stop pushing the frontier even earlier, to have more time and money available. 4. There's not many generations of frontier models left before Altman would need to halt scaling R&D. 5. Altman is currently be racing to AGI; and I don't think it's possible, on the slowdown hypothetical, for him to get enough evidence to convince him to stop in time. 2. Revenue (prices) may scale down alongside the cost of inference 1. Under perfect competition, widely-shared improvements in the production of a commodity will result in price decreases rather than profit increases. 2. There are various ways competition here is imperfect; but I think that imperfect competition benefits Google more than OpenAI. That's really bad, since OpenAI's finances are also much worse. 3. This still makes the cost of inference/cost of revenue estimates wrong; but OpenAI might not be able to make enough money to cover their debt and what you called "essential R&D". Dunno. 3. Everybody but Anthropic are already locked in to much of their inference (and R&D) spending via capex. 1. AI 2027 assumes that OpenAI uses different chips specialized in training vs. inference. 2. The cost of the datacenter and the GPU's it contains is fixed, and I believe it makes up most of the cost of inference today. OpenAI, via Project Stargate, is switching from renting GPU's t

We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.

This kind of thing isn't known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn't seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K pr... (read more)

OpenAI continuing to lose money

They are losing money only if you include all the R&D (where the unusual thing is very expensive training compute for experiments), which is only important while capabilities keep improving. If/when the capabilities stop improving quickly, somewhat cutting research spending won't affect their standing in the market that much. And also after revenue grows some more, essential research (in the slow capability growth mode) will consume a smaller fraction. So it doesn't seem like they are centrally "losing money", the plau... (read more)

5Randaly
I want to clarify that I'm criticizing "AI 2027"'s projection of R&D spending, i.e. this table. If companies cut R&D spending, that falsifies the "AI 2027" forecast. In particular, the comment I'm replying to proposed that while the current money would run out in ~2027, companies could raise more to continue expanding R&D spending. Raising money for 2028 R&D would need to occur in 2027; and it would need to occur on the basis of financial statements of at least a quarter before the raise. So in this scenario, they need to slash R&D spending in 2027- something the "AI 2027" authors definitely don't anticipate. Furthermore, your claim that "they are losing money only if you include all the R&D" may be false. We lack sufficient breakdown of OpenAI's budget to be certain. My estimate from the post was that most AI companies have 75% cost of revenue; OpenAI specifically has a 20% revenue sharing agreement with Microsoft; and the remaining 5% needs to cover General and Administrative expenses. Depending on the exact percentage of salary and G&A expenses caused by R&D, it's plausible that OpenAI eliminating R&D entirely wouldn't make it profitable today. And in the future OpenAI will also need to pay interest on tens of billions in debt.

in real life no intelligent being ... can convert themselves into a rock

if they become a rock ... the other players will not know it

Refusing in the ultimatum game punishes the prior decision to be unfair, not what remains after the decision is made. It doesn't matter if what remains is capable of making further decisions, the negotiations backed by ability to refuse an unfair offer are not with them, but with the prior decision maker that created them.

If you convert yourself into a rock (or a utility monster), it's the decision to convert yourself th... (read more)

1Knight Lee
After thinking about it more, it's possible your model of why Commitment Races resolve fairly, is more correct than my model of why Commitment Races resolve fairly, although I'm less certain they do resolve fairly. My model's flaw My model is that acausal influence does not happen until one side deliberately simulates the other and sees their commitment. Therefore, it is advantageous for both sides to commit up to but not exceeding some Schelling point of fairness, before simulating the other, so that the first acasual message will maximize their payoff without triggering a mutual disaster. I think one possibly fatal flaw of my model is that it doesn't explain why one side shouldn't add the exception "but if the other side became a rock with an ultimatum, I'll still yield to them, conditional on the fact they became a rock with an ultimatum before realizing I will add this exception (by simulating me or receiving acausal influence from me)." According to my model, adding this exception improves ones encounters with rocks with ultimatums by yielding to them, and does not increase the rate of encountering rocks with ultimatums (at least in the first round of acausal negotation, which may be the only round), since the exception explicitly rules out yielding to agents affected by whether you make exception. This means that in my model, becoming a rock with an ultimatum may still be the winning strategy, conditional on the fact the agent becoming a rock with an ultimatum doesn't know it is the winning strategy, and the Commitment Race problem may reemerge. Your model My guess of your model, is that acausal influence is happening a lot, such that refusing in the ultimatum game can successfully punish the prior decision to be unfair (i.e. reduce the frequency of prior decisions to be unfair). In order for your refusal to influence their frequency of being unfair, your refusal has to have some kind of acausal influence on them, even if they are relatively simpler

LW doesn't punish, it upvotes-if-interesting and then silently judges.

confidence / effort ratio

(Effort is not a measure of value, it's a measure of cost.)

5Cole Wyeth
Yeah, I was thinking greater effort is actually necessary in this case. For context, my lower effort posts are usually more popular. Also the ones that focus on LLMs which is really not my area of expertise.

The other side is forced to agree to that, just to get a little.

That's not how the ultimatum game works in non-CDT settings, you can still punish the opponent for offering too little, even at the cost of getting nothing in the current possible world (thereby reducing its weight and with it the expected cost). In this case it deters commitment racing.

1Knight Lee
:) yes, I was illustrating what the Commitment Race theory says will happen, not what I believe (in that paragraph). I should have used quotation marks or better words. Punishing the opponent for offering too little is what my pie example was illustrating. ---------------------------------------- The proponents of Commitment Race theory will try to refute you by saying "oh yeah, if your opponent was a rock with an ultimatum, you wouldn't punish it. So an opponent who can make himself rock-like still wins, causing a Commitment Race." Rocks with ultimatums do win in theoretical settings, but in real life no intelligent being (who has actual amounts of power) can convert themselves into a rock with an ultimatum convincingly enough that other intelligent beings will already know they are rocks with ultimatums before they decide what kind of rock they want to become. Real life agents have to appreciate that even if they become a rock with an ultimatum, the other players will not know it (maybe due to deliberate self blindfolding), until the other players also become rocks with ultimatums. And so they have to choose an ultimatum which is compatible with other ultimatums, e.g. splitting a pie by taking 50%. Real life agents are the product of complex processes like evolution, making it extremely easy for your opponent to refuse to simulate you (and the whole process of evolution that created you), and thus refuse to see what commitment you made, until they made their commitment. Actually it might turn out quite tricky to avoid accurately imagining what another agent would do (and giving them acausal influence on you), but my opinion is it will be achievable. I'm no longer very certain.

The term is a bit conflationary. Persuasion for the masses is clearly a thing, its power is coordination of many people and turning their efforts to (in particular) enforce and propagate the persuasion (this works even for norms that have no specific persuader that originates them, and contingent norms that are not convergently generated by human nature). Individual persuasion with a stronger effect that can defeat specific people is probably either unreliable like cults or conmen (where many people are much less susceptible than some, and objective decept... (read more)

the impact of new Blackwell chips with improved computation

It's about world size, not computation, and has a startling effect that probably won't occur again with future chips, since Blackwell sufficiently catches up to models at the current scale.

But even then, OpenAI might get to ~$25bn annualized revenue that won't be going away

What is this revenue estimate assuming?

The projection for 2025 is $12bn at 3x/year growth (1.1x per month, so $1.7bn per month at the end of 2025, $3bn per month in mid-2026), and my pessimistic timeline above assumes... (read more)

1Remmelt
  Thanks, I got to say I’m a total amateur when it comes to GPU performance. So will take the time to read your linked-to comment to understand it better. 

Not knowing n(-) results in not knowing expected utility of b (for any given b), because you won't know how the terms a(n(a), n(a)) are formed.

(And also the whole being given numeric codes of programs as arguments thing gets weird when you are postulated to be unable to interpret what the codes mean. The point of Newcomblike problems is that you get to reason about behavior of specific agents.)

1Tapatakt
Basically you know if Omega's program is the same as you or not (assuming you actually are b and not a)

I can't think of any reason to give a confident, high precision story that you don't even believe in!

Datapoints generalize, a high precision story holds gears that can be reused in other hypotheticals. I'm not sure what you mean by the story being presented as "confident" (in some sense it's always wrong to say that a point prediction is "confident" rather than zero probability, even if it's the mode of a distribution, the most probable point). But in any case I think giving high precision stories is a good methodology for communicating a framing, point... (read more)

Question 1: Assume you are program b. You want to maximize the money you receive. What should you output if your input is (x,x) (i.e., the two numbers are equal)?

Question 2: Assume you are the programmer writing program b. You want to maximize the expected money program b receives. How should you design b to behave when it receives an input (x,x)?

Do you mean to ask how b should behave on input (n(b), n(b)), and how b should be written to behave on input (n(b), n(b)) for that b?

If x differs from n(b), it might matter in some subtle ways but not straig... (read more)

1Tapatakt
Yes. You can assume that programmer doesn't know how n works.

Official policy documents from AI companies can be useful in bringing certain considerations into the domain of what is allowed to be taken seriously (in particular, by the governments), as opposed to remaining weird sci-fi ideas to be ignored by most Serious People. Even declarations by AI company leaders or Turing award winners of Nobel laureates or some of the most cited AI scientists won't by themselves have that kind of legitimizing effect. So it's not necessary for such documents to be able to directly affect actual policies of AI companies, they can still be important in affecting these policies indirectly.

6Thane Ruthenis
Fair point. The question of the extent to which those documents can be taken seriously as statements of company policy (as opposed to only mattering in signaling games) is still worthwhile, I think.

I think it's overdetermined by Blackwell NVL72/NVL36 and long reasoning training that there will be no AI-specific "crash" until at least late 2026. Reasoning models want a lot of tokens, but their current use is constrained by cost and speed, and these issues will be going away to a significant extent. Already Google has Gemini 2.5 Pro (taking advantage of TPUs), and within a few months OpenAI and Anthropic will make reasoning variants of their largest models practical to use as well (those pretrained at the scale of 100K H100s / ~3e26 FLOPs, meaning GPT-... (read more)

1Remmelt
Thanks, I might be underestimating the impact of new Blackwell chips with improved computation.  I’m skeptical whether offering “chain-of-thought” bots to more customers will make a significant difference. But I might be wrong – especially if new model architectures would come out as well.  And if corporations throw enough cheap compute behind it plus widespread personal data collection, they can get to commercially very useful model functionalities. My hope is that there will be a market crash before that could happen, and we can enable other concerned communities to restrict the development and release of dangerously unscoped models.   What is this revenue estimate assuming?

I think the idea of effective FLOPs has more narrow applicability than what you are running with, many things that count as compute multipliers don't scale. They often only hold for particular capabilities that stop being worth boosting separately at greater levels of scale, or particular data that stops being available in sufficient quantity. An example of a scalable compute multiplier is MoE (even as it destroys data efficiency, and so damages some compute multipliers that rely on selection of high quality data). See Figure 4 in the Mamba paper for anoth... (read more)

1Benjamin_Todd
Hmm interesting. I see the case for focusing only on compute (since relatively objective), but it still seems important to try to factor in some amount of algorithmic progress to pretraining (which means that the cost to achieve GPT-6 level performance will be dropping over time). The points on Epoch are getting outside of my expertise – I see my role as to synthesise what experts are saying. It's good to know these critiques exist and it would be cool to see them written up and discussed.

spending tens of billions of dollars to build clusters that could train a GPT-6-sized model in 2028

Traditionally steps of GPT series are roughly 100x in raw compute (I'm not counting effective compute, since it's not relevant to cost of training). GPT-4 is 2e25 FLOPs. Which puts "GPT-6" at 2e29 FLOPs. To train a model in 2028, you would build an Nvidia Rubin Ultra NVL576 (Kyber) training system in 2027. Each rack holds 576 compute dies at about 3e15 BF16 FLOP/s per die[1] or 1.6e18 FLOP/s per rack. A Blackwell NVL72 datacenter costs about $4M per rack t... (read more)

1Benjamin_Todd
Thanks, useful to have these figures and an independent data on these calculations. I've been estimating it based on a 500x increase in effective FLOP per generation, rather than 100x of regular FLOP. Rough calculations are here. At the current trajectory, the GPT-6 training run costs $6bn in 2028, and GPT-7 costs $130bn in 2031.  I think that makes GPT-8 a couple of trillion in 2034. You're right that if you wanted to train GPT-8 in 2031 instead, then it would cost roughly 500x more than training GPT-7 that year.

probability mass for AI that can automate all AI research is in the 2030s ... broadly due to the tariffs and ...

Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.

With AI assistance, the degree to which an alternative is ready-to-go can differ a lot compared to its prior human-developed state. Also, an idea that's ready-to-go is not yet an edifice of theory and software that's ready-to-go in replacing 5e28 FLOPs transformer models, so some level of AI assistance is still necessary with 2 year timelines. (I'm not necessarily arguing that 2 year timelines are correct, but it's the kind of assumption that my argument should survive.)

The critical period includes the time when humans are still in effective control of the... (read more)

The most important thing about Llama 4 is that the 100K H100s run that was promised got canceled, and its flagship model Behemoth will be a 5e25 FLOPs compute optimal model[1] rather than a ~3e26 FLOPs model that a 100K H100s training system should be able to produce. This is merely 35% more compute than Llama-3-405B from last year, while GPT-4.5, Grok 3 and Gemini 2.5 Pro are probably around 3e26 FLOPs or a bit more. They even explicitly mention that it was trained on 32K GPUs (which must be H100s). Since Behemoth is the flagship model, a bigger model got... (read more)

haven't heard this said explicitly before

Okay, this prompted me to turn the comment into a post, maybe this point is actually new to someone.

prioritization depends in part on timelines

Any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. Even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. So it makes sense to still prioritize things that have no hope at all of becoming practical for decades ... (read more)

1cdt
  This is a key insight and I think that operationalising or pinning down the edges of a new research area is one of the longest time-horizon projects there is. If the METR estimate is accurate, then developing research directions is a distinct value-add even after AI research is semi-automatable. 
2abramdemski
This sort of approach doesn't make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate. 
6Cole Wyeth
I haven't heard this said explicitly before but it helps me understand your priorities a lot better. 

"Revenue by 2027.5" needs to mean "revenue between summer 2026 and summer 2027". And the time when the $150bn is raised needs to be late 2026, not "2027.5", in order to actually build the thing by early 2027 and have it completed for several months already by mid to late 2027 to get that 5e28 BF16 FLOPs model. Also Nvidia would need to have been expecting this or similar sentiment elsewhere months to years in advance, as everyone in the supply chain can be skeptical that this kind of money actually materializes by 2027, and so that they need to build addit... (read more)

A 100K H100s training system is a datacenter campus that costs about $5bn to build. You can use it to train a 3e26 FLOPs model in ~3 months, and that time costs about $500M. So the "training cost" is $500M, not $5bn, but in order to do the training you need exclusive access to a giant 100K H100s datacenter campus for 3 months, which probably means you need to build it yourself, which means you still need to raise the $5bn. Outside these 3 months, it can be used for inference or training experiments, so the $5bn is not wasted, it's just a bit suboptimal to ... (read more)

1SorenJ
Thank you very much, this is so helpful! I want to know if I am understanding things correctly again, so please correct me if I am wrong on any of the following: By "used for inference," this just means basically letting people use the model? Like when I go to the chatgpt website, I am using the datacenter campus computers that were previously used for training? (Again, please forgive my noobie questions.) For 2025, Abilene is building a 100,000-chip campus. This is plausibly around the same number of chips that were used to train the~3e26 FLOPs GPT4.5 at the Goodyear campus. However, the Goodyear campus was using H100 chips, but Abilene will be using Blackwell NVL72 chips. These improved chips means that for the same number of chips we can now train a 1e27 FLOPs model instead of just a 3e26 model. The chips can be built by summer 2025, and a new model trained by around end of year 2025.  1.5 years after the Blackwell chips, the new Rubin chip will arrive. The time is now currently ~2027.5.  Now a few things need to happen: 1. The revenue growth rate from 2024 to 2025 of 3x/year continues to hold. In that case, after 1.5 years, we can expect $60bn in revenue by 2027.5. 2. The 'raised money' : 'revenue' ratio of $30bn : $12bn in 2025 holds again. In that case we have $60bn x 2.5 = $150bn. 3. The decision would need to be made to purchase the $150 bn worth of Rubin chips (and Nvidia would need to be able to supply this.) More realistically, assuming (1) and (2) hold, it makes more sense to wait until the Rubin Ultra comes out to spend the $150bn on.  Or, some type of mixed buildout would occur, some of that $150bn in 2027.5 would use the Rubin non-Ultra to train a 2e28 FLOPs model, and the remainder would be used to build an even bigger model in 2028 that uses Rubin Ultra. 

GPT-4.5 might've been trained on 100K H100s of the Goodyear Microsoft site ($4-5bn, same as first phase of Colossus), about 3e26 FLOPs (though there are hints in the announcement video it could've been trained in FP8 and on compute from more than one location, which makes up to 1e27 FLOPs possible in principle).

Abilene site of Crusoe/Stargate/OpenAI will have 1 GW of Blackwell servers in 2026, about 6K-7K racks, possibly at $4M per rack all-in, for the total of $25-30bn, which they've already raised money for (mostly from SoftBank). They are projecting abo... (read more)

2SorenJ
Do I have the high level takeaways here correct? Forgive my use of the phrase "Training size," but I know very little about diferent chips, so I am trying to distill it down to simple numbers.   2024: a) OpenAI revenue: $3.7 billion. b) Training size: 3e26 to 1e27 FLOPs. c) Training cost: $4-5 billion. 2025 Projections:  a) OpenAI revenue: $12 billion.  b) Training size: 5e27 FLOPs. b) Training cost: $25-30 billion.    2026 Projections: a) OpenAI revenue: ~$36 billion to $60 billlion.  At this point I am confused: why you are saying Rubin arriving after Blackwell would make the revenue more like $60 billion? Again, I know very little about chips. Wouldn't the arrival of a different chip also change OpenAIs cost? b) Training size: 5e28 FLOPs. c) Training cost: $150 billion.  Assuming investors are willing to take the same ratio of revenue : training cost as before, this would predict $70 billion to $150 billion. In other words, to get to the $150 billion mark requires that Rubin arrives after Blackwell, openAI makes revenue $60 billion in revenue, and investors take a 2.5 multiplier for $60 x 2.5 = $150 billion. Is there anything else that I missed?

Your point is one of the clues I mentioned that I don't see as comparably strong to the May 2023 paper, when it comes to prediction of loss/perplexity. The framing in your argument appeals to things other than the low-level metric of loss, so I opened my reply with focusing on it rather than the more nebulous things that are actually important in practice. Scaling laws work with loss the best (holding across many OOMs of compute), and repeating 3x rather than 7x (where loss first starts noticeably degrading) gives some margin of error. That is, a theoretic... (read more)

I meant "realiable agents" in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it's not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don't expect that evidence exists that distinguishes these possibilities even within the leading AI companies.

The reason Rubin NVL576 probably won't help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won't significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).

when we get 576 (194 gpus)

The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use "GPU" to refer to compute dies rather than... (read more)

The solution is increase in scale-up world size, but the "bug" I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn't again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.

The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on ... (read more)

The loss goes down; whether that helps in some more legible way that also happens to be impactful is much harder to figure out. The experiments in the May 2023 paper show that training on some dataset and training on a random quarter of that dataset repeated 4 times result in approximately the same loss (Figure 4). Even 15 repetitions remain useful, though at that point somewhat less useful than 15 times more unique data. There is also some sort of double descent where loss starts getting better again after hundreds of repetitions (Figure 9 in Appendix D).... (read more)

1Petropolitan
I'm afraid you might have missed the core thesis of my comment, let me reword. I'm arguing one should not extrapolate findings from that paper on what's Meta training now. The Llama 4 model card says the herd was trained on "[a] mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI": https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md To use a term from information theory, these posts probably have much lower factual density than curated web text in C4. There's no public information how fast the loss goes down even on the first epoch of this kind of data let alone several ones. I generated a slightly more structured write-up of my argument and edited it manually, hope it will be useful Let's break down the extrapolation challenge: * Scale Difference: * Muennighoff et al.: Studied unique data budgets up to 178 billion tokens and total processed tokens up to 900 billion. Their models were up to 9 billion parameters. * Llama 4 Behemoth: Reportedly trained on >30 trillion tokens (>30,000 billion). The model has 2 trillion total parameters (~288B active). * The Gap: We're talking about extrapolating findings from a regime with ~170x fewer unique tokens (comparing 178B to 30T) and models ~30x smaller (active params). While scaling laws can be powerful, extrapolating across 2 orders of magnitude in data scale carries inherent risk. New phenomena or different decay rates for repeated data could emerge. * Data Composition and Quality: * Muennighoff et al.: Used C4 (filtered web crawl) and OSCAR (less filtered web crawl), plus Python code. They found filtering was more beneficial for the noisier OSCAR. * Llama 4 Behemoth: The >30T tokens includes a vast amount of web data, code, books, etc., but is also likely to contain a massive proportion of public Facebook and Instagram data. * The

I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we'll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).

But then the possibilities for 2027 branch on whether there are reliable agents, which doesn't seem knowable either way ... (read more)

2Roman Leventov
Very reliable, long-horizon agency is already in the capability overhang of Gemini 2.5 pro, perhaps even the previous-tier models (gemini 2.0 exp, sonnet 3.5/3.7, gpt-4o, grok 3, deepseek r1, llama 4). It's just the matter of harness/agent-wrapping logic and inference-time compute budget. Agency engineering is currently in the brute-force stage. Agent engineers over rely on a "single LLM rollout" to be robust, but also often use LLM APIs that sometimes lack certain nitty-gritty affordances for implementing reliable agency, such as "N completions" with timely self-consistency pruning and perhaps scaling N up again when model's own uncertainty is up. This somewhat reminds me of the early LLM scale-up era where LLM engineers over relied on "stack more layers" without digging more into the architectural details. The best example is perhaps Megatron, a trillion-parameter model from 2021 whose performance is probably abysmal relative to the 2025 models of ~10B parameters (perhaps even 1B). So, the current agents (such as Cursor, Claude Code, Replit, Manus) are in the "Megatron era" of efficiency. In four years, even with the same raw LLM capability, agents will be very reliable. To give a more specific example when robustness is a matter of spending more on inference, let's consider Gemini 2.5 pro: contrary to the hype, it often misses crucial considerations or acts strangely stupidly on modestly sized contexts (less than 50k tokens). However, seeing these omissions, it's obvious to me that if someone applied ~1k token-sized chunks of that context to 2.5-pro's output and asked a smaller LLM (flash or flash lite) "did this part of the context properly informed that output", flash would answer No when 2.5-pro indeed missed something important from that part of the context. This should trigger a fallback on N-completions, 2.5 self-review with smaller pieces of the context, breaking down the context hierarchically, etc.
3Paragox
>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don't know what you mean by this? Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing gpu paradigms). Was there some "bug" with the hardware scheduler or some low level memory system, or perhaps an issue with the sparsity implementation that I was unaware off? There were of course general refinements across the board for memory architecture, but nothing I'd consider groundbreaking enough to call it "fixing a bug".  I reskimmed through hopper/blackwell whitepapers and LLM/DR queried and really not sure what you are referring to. If anything, there appear to be some rough edges introduced with NV-HBI and relying on a virtualized monolithic gpu in code vs the 2x die underlying. Or are you perhaps arguing that going MCM and beating the reticle limit was itself the one time thing?   

A power seeker is ambitious without an ambition, which is not an implication of being agentic.

The announcement post says the following on the scale of Behemoth:

we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU. The overall data mixture for training consisted of more than 30 trillion tokens

This puts Llama 4 Behemoth at 5e25 FLOPs (30% more than Llama-3-405B), trained on 32K H100s (only 2x more than Llama-3-405B) instead of the 128K H100s (or in any case, 100K+) they shou... (read more)

2Petropolitan
Muennighoff et al. (2023) studied data-constrained scaling on C4 up to 178B tokens while Meta presumably included all the public Facebook and Instagram posts and comments. Even ignoring the two OOM difference and the architectural dissimilarity (e. g., some experts might overfit earlier than the research on dense models suggests, perhaps routing should take that into account), common sense strongly suggests that training twice on, say, a Wikipedia paragraph must be much more useful than training twice on posts by Instagram models and especially comments under those (which are often as like as two peas in a pod).

For me a specific crux is scaling laws of R1-like training, what happens when you try to do much more of it, which inputs to this process become important constraints and how much they matter. This working out was extensively brandished but not yet described quantitatively, all the reproductions of long reasoning training only had one iteration on top of some pretrained model, even o3 isn't currently known to be based on the same pretrained model as o1.

The AI 2027 story heavily leans into RL training taking off promptly, and it's possible they are resonati... (read more)

Non-Google models of late 2027 use Nvidia Rubin, but not yet Rubin Ultra. Rubin NVL144 racks have the same number of compute dies and chips as Blackwell NVL72 racks (change in the name is purely a marketing thing, they now count dies instead of chips). The compute dies are already almost reticle sized, can't get bigger, but Rubin uses 3nm (~180M Tr/mm2) while Blackwell is 4nm (~130M Tr/mm2). So the number of transistors per rack goes up according to transistor density between 4nm and 3nm, by 1.4x, plus better energy efficiency enables higher clock speed, m... (read more)

romeo100

Thanks for the comment Vladimir! 

[...] for the total of 2x in performance.

I never got around to updating based on the GTC 2025 announcement but I do have the Blackwell to Rubin efficiency gain down as ~2.0x adjusted by die size so looks like we are in agreement there (though I attributed it a little differently based on information I could find at the time). 

So the first models will start being trained on Rubin no earlier than late 2026, much more likely only in 2027 [...]

Agreed! I have them coming into use in early 2027 in this chart.

This predic

... (read more)

Beliefs held by others are a real phenomenon, so tracking them doesn't give them unearned weight in attention, as long as they are not confused with someone else's beliefs. You can even learn things specifically for the purpose of changing their simulated mind rather than your own (in whatever direction the winds of evidence happen to blow).

The scale of training and R&D spending by AI companies can be reduced on short notice, while global inference buildout costs much more and needs years of use to pay for itself. So an AI slowdown mostly hurts clouds and makes compute cheap due to oversupply, which might be a wash for AI companies. Confusingly major AI companies are closely tied to cloud providers, but OpenAI is distancing itself from Microsoft, and Meta and xAI are not cloud providers, so wouldn't suffer as much. In any case the tech giants will survive, it's losing their favor that seems more likely to damage AI companies, making them no longer able to invest as much in R&D.

3Remmelt
This is a solid point that I forgot to take into account here.  What happens to GPU clusters inside the data centers build out before the market crash?  If user demand slips and/or various companies stop training, that means that compute prices will slump. As a result, cheap compute will be available for remaining R&D teams, for the three years at least that the GPUs last.  I find that concerning. Because not only is compute cheap, but many of the researchers left using that compute will have reached an understanding that scaling transformer architectures on internet-available data has become a dead end. With investor and managerial pressure to release LLM-based products gone, researchers will explore their own curiosities. This is the time you’d expect the persistent researchers to invent and tinker with new architectures – that could end up being more compute and data efficient at encoding functionality.  ~ ~ ~ I don’t want to skip over your main point. Is your argument that AI companies will be protected from a crash, since their core infrastructure is already build?  Or more precisely:  * that since data centers were build out before the crash, that compute prices end up converging on mostly just the cost of the energy and operations needed to run the GPU clusters inside, * which in turn acts as a financial cushion for companies like OpenAI and Anthropic, for whom inference costs are now lower, * where those companies can quickly scale back expensive training and R&D, while offering their existing products to remaining users at lower cost. * as a result of which, those companies can continue to operate during the period that funding has dried up, waiting out the 'AI winter' until investors and consumers are willing to commit their money again. That sounds right, given that compute accounts for over half of their costs. Particularly if the companies secure another large VC round ahead of a crash, then they should be able to weather the storm. E.g. the
1funnyfranco
Meditations on Moloch is an excellent piece - but it’s not the argument I’m making. Scott describes how competition leads to suboptimal outcomes, yes. But he stops at describing the problem. He doesn’t draw the specific conclusion that AGI alignment is structurally impossible because any attempt to slow down or “align” will be outcompeted by systems that don’t bother. He also doesn’t apply that conclusion to the AGI race with the same blunt finality I do: this ends in extinction, and it cannot be stopped. So unless you can point to the section where Scott actually follows the AGI race dynamics to the conclusion that alignment will be systematically optimised away - rather than just made “more difficult” - then no, that essay doesn’t make my argument. It covers part of the background context. That’s not the same thing. This kind of reply - “here’s a famous link that kind of gestures in the direction of what you’re talking about” - is exactly the vague dismissal I’ve been calling out. If my argument really has been made before, someone should be able to point to where it’s clearly laid out. So far, no one has. The sidestepping and lack of direct engagement in my arguments in this comment section alone has to be studied.

if we didn't have a capitalist system, then the entire point about profit motives, pride, and race dynamics wouldn't apply

Presence of many nations without a central authority still contributes to race dynamics.

2Rafael Harth
Yeah, valid correction.
2funnyfranco
Exactly. That’s the point I’ve been making - this isn’t about capitalism as an ideology, it’s about competition. Capitalism is just the most efficient competitive structure we’ve developed, so it accelerates the outcome. But any decentralised system with multiple actors racing for advantage - whether nation-states or corporations - will ultimately produce the same incentives. That’s the core of the argument.
Load More