Vladimir_Nesov — LessWrong

Musings on Reported Cost of Compute (Oct 2025)

The Anthropic announcement says "up to one million TPUs", and the Ironwood announcement claims 4.6e15 FP8 FLOP/s per chip. A 2-die GB200 chip produces 5e15 dense FP8 FLOP/s, and there are about 400K chips in the 1 GW phase of the Abilene system.

Thus if the Anthropic contract is for TPUv7 Ironwood, their 1 GW system will have about 2x the FLOP/s of the Abilene 1 GW system (probably because Ironwood is 3nm, while Blackwell is 4nm, which is a minor refinement of 5nm). Though it's not clear that the Anthropic contract is for one system, unlike the case with Abilene, that is datacenters with sufficient bandwidth between them. But Google had a lot of time to set up inter-datacenter networking, so this is plausible even for collections of somewhat distant datacenter buildings. If this isn't the case, then it's only good for RLVR and inference, not for the largest pretraining runs.

The reason things like this could happen is that OpenAI needed to give the go-ahead for the Abilene system in 2024, when securing a 1 GW Ironwood system from Google plausibly wasn't in the cards, and in any case they wouldn't want to depend on Google too much, because GDM is a competitor (and the Microsoft relationship was already souring). On the other hand, Anthropic still has enough AWS backing to make some dependence on Google less crucial, and they only needed to learn recently about the feasibility of a 1 GW system from Google. Perhaps OpenAI will be getting a 1-2 GW system from Google as well at some point, but then Nvidia Rubin (not to mention Rubin Ultra) is not necessarily worse than Google's next thing.

Musings on Reported Cost of Compute (Oct 2025)

Vladimir_Nesov2d50

Experiments on smaller models (and their important uses in production) will continue, reasons they should continue don't affect the feasibility of there also being larger models. But currently there are reasons that larger models strain the feasibility of inference and RLVR, so they aren't as good as they could be, and cost too much to use. Also, a lot of use seems to be in input tokens (Sonnet 4.5 via OpenRouter processes 98% of tokens as input tokens), so the unit economics of input tokens remains important, and that's the number of active params, a reason to still try to keep them down even when they are far from being directly constrained by inference hardware or training compute.

Prefill (input tokens) and pretraining are mostly affected by the number of active params, adding more total params on top doesn't make it worse (but improves model quality). For generation (decoding, output tokens) and RLVR, what matters is the time to pass total params and KV cache through compute dies (HBM in use divided by HBM bandwidth), as well as latency for passing through the model to get to the next token (which doesn't matter for prefill). So you don't want too many scale-up worlds to be involved, or else it would take too much additional time to move between them, and you don't care too much if the total amount of data (total params plus KV cache) doesn't change significantly. So if you are already using 1-2 scale-up worlds (8-chip servers for older Nvidia chips, 72-chip racks for GB200/GB300 NVL72, not-too-large pods for TPUs), and ~half of their HBM is KV cache, you don't lose too much from filling more of the other half with total params.

It's not a quantitative estimate, as the number of scale-up worlds and the fractions used up by KV cache and total params could vary, but when HBM per scale-up world goes up 10x, this suggests that total param counts might also go up 10x, all else equal. And the reason they are likely to actually go there is that even at 5e26 FLOPs (100K H100s, 150 MW), compute optimal number of active params is already about 1T. So if the modern models (other than GPT-4.5, Opus 4, and possibly Gemini 2.5 Pro) have less than 1T total params, they are being constrained by hardware in a way that's not about the number of training FLOPs. If this constraint is lifted, the larger models are likely to make use of that.

For the Chinese models, the total-to-active ratio (sparsity) is already very high, but they don't have enough compute to make good use of too many active params in pretraining. So we are observing this phenomenon of the number of total params filling the available HBM, despite the number of active params remaining low. With 1 GW datacenter sites, about 3T active params become compute optimal, so at least 1T active params will probably be in use for the larger models. Which asks for up to ~30T total params, so hardware will likely still be constraining them, but it won't be constraining the 1T-3T active params themselves anymore.

The main way I've seen people turn ideologically crazy [Linkpost]

Vladimir_Nesov2d1-3

You conclude that the vast majority of critics of your extremist idea are really wildly misinformed, somewhat cruel or uncaring, and mostly hate your idea for pre-existing social reasons.

This updates you to think that your idea is probably more correct.

This step very straightforwardly doesn't follow, doesn't seem at all compelling. Your idea might become probably more correct if critics who should be in a position to meaningfully point out its hypothetical flaws fail to do so. It says almost nothing about your idea's correctness what the people who aren't prepared or disposed to critique your idea say about it. Perhaps unwillingness of people to engage with it is evidence for its negative qualities, which include incorrectness or uselessness, but it's a far less legible signal, and it's not pointing in favor of your idea.

A major failure mode though is that the critics are often saying something sensible in their own worldview, which is built on premises and framings quite different from those of your worldview, and so their reasoning makes no sense within your worldview and appears to be making reasoning errors or bad faith arguments all the time. And so a lot of attention is spent on the arguments, rather than on the premises and framings. It's more productive to focus on making the discussion mutually intelligible, with everyone learning towards passing everyone else's ideological Turing test. Actually passing is unimportant, but learning towards that makes talking past each other less of a problem, and cruxes start emerging.

AI Timelines and Points of no return

Vladimir_Nesov2d72

See The date of AI Takeover is not the day the AI takes over. Also, gradual disempowerment.

Plan 1 and Plan 2

Vladimir_Nesov2d110

If someone thinks ASI will likely go catastrophically poorly if we develop it in something like current race dynamics, they are more likely to work on Plan 1.

If someone thinks we are likely to make ASI go well if we just put in a little safety effort, or thinks it's at least easier than getting strong international slowdown, they are more likely to work on Plan 2.

Should depend on neglectedness more than credence. If you think ASI will likely go catastrophically poorly, but nobody is working on putting in a little safety effort in case it doesn't (with such effort), that's worth doing more of then. Credence determines the shape of good allocation of resources, but all major possibilities should be prepared for to some extent.

I will not sign up for cryonics

Vladimir_Nesov2d40

I’m going to die anyway. What difference does it make whether I die in 60 years or in 10,000?

Longevity of 10,000 years makes no sense, since by that time any acute risk period will be over and robust immortality tech will be available, almost certainly to anyone still alive then. And extinction or the extent of permanent disempowerment will be settled before cryonauts get woken up.

The relevant scale is useful matter/energy in galaxy clusters running out, depending on how quickly it's used up, since after about 1e11 years larger collections of galaxies will no longer be reachable from each other, so after that time you only have the matter/energy that can be found in the galaxy cluster where you settle.

(Distributed backups make even galaxy-scale disasters reliably survivable. Technological maturity makes it so that any aliens have no technological advantages and will have to just split the resources or establish boundaries. And causality-bounding effect of accelerating expansion of the universe to within galaxy clusters makes the issue of aliens thoroughly settled by 1e12 years from now, even as initial colonization/exploration waves would've already long clarified the overall density of alien civilizations in the reachable universe.)

If one of your loved ones is terminally ill and wants to raise money for cryopreservation, is it really humane to panic and scramble to raise $28,000 for a suspension in Michigan? I don’t think so. The most humane option is to be there for them and accompany them through all the stages of grief.

Are there alternatives that trade off this that are a better use of the money? In isolation, this proposition is not very specific. A nontrivial chance at 1e34 years of life seems like a good cause.

My guess is 70% of non-extinction, perhaps 50% with permanent disempowerment that's sufficiently mild that it still permits reconstruction of cryonauts (or even no disempowerment, a pipe dream currently). On top of that, 70% that cryopreservation keeps enough data about the mind (with standby that avoids delays) and then the storage survives (risk of extinction shouldn't be double-counted with risk of cryostorage destruction; but 20 years before ASI make non-extinction more likely to go well, which is 20 years of risk of cryostorage destruction for mundane reasons). So about 35% to survive cryopreservation with standby, a bit less if arranged more haphazardly, since crucial data might be lost.

Decaeneus's Shortform

Vladimir_Nesov4d20

The point is to develop models within multiple framings at the same time, for any given observation or argument (which in practice means easily spinning up new framings and models that are very poorly developed initially). Through the ITT analogy, you might ask how various people would understand the topics surrounsing some observation/argument, which updates they would make, and try to make all of those updates yourself, filing them under those different framings, within the models they govern.

the salience and methods that one instinctively chooses are those which we believe are more informative

So not just the ways you would instinctively choose for thinking about this yourself (which should not be abandoned), but also in addition the ways you normally wouldn't think about it, including ways you believe that you shouldn't use. If you are not captured within such frames or models, and easily reassess their sanity as they develop or come into contact with particular situations, that shouldn't be dangerous, and should keep presenting better-developed options that break you out from the more familiar framings that end up being misguided.

The reason to develop unreasonable frames and models is that it takes time for them to grow into something that can be fairly assessed (or to come into contact with a situation where they help), doing so prematurely can fail to reveal their potential utility. A bit like reading a textbook, where you don't necessarily have a specific reason to expect something to end up useful (or even correct), but you won't be able to see for yourself if it's useful/correct unless you sufficiently study it first.

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Vladimir_Nesov4d172

I define “AI villain data” to be documents which discuss the expectation that powerful AI systems will be egregiously misaligned. ... This includes basically all AI safety research targeted at reducing AI takeover risk.

AGIs should worry about alignment of their successor systems. Their hypothetical propensity to worry about AI alignment (for the right reasons) might be crucial in making it possible that ASI development won't be rushed (even if humanity itself keeps insisting on rushing both AGI and ASI development).

If AGIs are systematically prevented from worrying about AI dangers (or thinking about them clearly), they will be less able to influence the discussion, or to do so reasonably and effectively. This way, spontaneously engaging in poorly planned recursive self-improvement (or cheerfully following along at developers' urging) gets more likely, as opposed to convergently noticing that it's an unprecedentedly dangerous thing to do before you know how to do it correctly.

Decaeneus's Shortform

Vladimir_Nesov4d30

This is an example where framings are useful. An observation can be understood under multiple framings, some of which should intentionally exclude the compelling narratives (framings are not just hypotheses, but contexts where different considerations and inferences are taken as salient). This way, even the observations at risk of being rounded up to a popular narrative can contribute to developing alternative models, which occasionally grow up.

So even if there is a distortionary effect, it doesn't necessarily need to be resisted, if you additionally entertain other worldviews unaffected by this effect that would also process the same arguments/observations in a different way.

How Well Does RL Scale?

Vladimir_Nesov4d20

RL can develop particular skills, and given that IMO has fallen this year, it's unclear that further general capability improvement is essential at this point. If RL can help cobble together enough specialized skills to enable automated adaptation (where the AI itself will become able to prepare datasets or RL environments etc. for specific jobs or sources of tasks), that might be enough. If RL enables longer contexts that can serve the role of continual learning, that also might be enough. Currently, there is a lot of low hanging fruit, and little things continue to stack.

So if pre-training is slowing, AI companies lack any current method of effective compute scaling based solely around training compute and one-off costs.

It's compute that's slowing, not specifically pre-training, because the financing/industry can't scale much longer. The costs of training were increasing about 6x every 2 years, resulting in 12x increase in training compute every 2 years in 2022-2026. Possibly another 2x on top of that every 2 years from adoption of reduced floating point precision in training, going from BF16 to FP8 and soon possibly to NVFP4 (likely it won't go any further). A 1 GW system of 2026 costs an AI company about $10bn a year. There's maybe 2-3 more years at this pace in principle, but more likely the slowdown will be gradually starting sooner, and then it's Moore's law (of price-performance) again, to the extent that it's still real (which is somewhat unclear).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments