All of Paragox's Comments + Replies

The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:

If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answ... (read more)

Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn't artificially limit it. 

Additionally, by saying
>can't be repeated even with Rubin Ultra NVL576

I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable. 

4Vladimir_Nesov
The reason Rubin NVL576 probably won't help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won't significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof). The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use "GPU" to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than "GPU".

>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference)

Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don't know what you mean by this?

Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing... (read more)

5Vladimir_Nesov
The solution is increase in scale-up world size, but the "bug" I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn't again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8. The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year's models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn't really present in 2022-2024, but in 2025 it's critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact. (There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it'll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system's time is the same regardless of what it's doing. Tensor parallelism lets matrix multiplications g
1anaguma
My guess is that he’s referring to the fact that Blackwell offers much larger world sizes than Hopper and this makes LLM training/inference more efficient. Semianalysis has argued something similar here: https://semianalysis.com/2024/12/25/nvidias-christmas-present-gb300-b300-reasoning-inference-amazon-memory-supply-chain

This is the most moving piece I've read since Situational Awareness. Bravo! Emotionally, I was particularly moved by the final two sentences in the "race" ending -- hats off to that bittersweet piece of prose. Materially, this is the my favorite holistic amalgamation of competently weighted data sources and arguments woven into a cohesive narrative, and personally has me meaningfully reconsidering some of the more bearish points in my model (like the tractability of RL on non-verifiability: Gwern etc have made individual points, but something about this co... (read more)

How to square these claims with the GLP-1 drug consumption you've stated in a previous post? I'd wager this powers the pareto majority of your perceived ease of leanness vs average population, and that you are somewhat pointlessly sacrificing the virtues of variety in reverse-pareto fashion. 

0sapphire
I haven't taken any in a long time. When I quit my body fat was much higher. GLP Dosage ramps up a lot for many people. Definitely did for me. Its not exactly cheap, and if supply is disrupted you might not even be able to get it for current prices. 2.5mg a week of a GLP-1 agonist is not something I want to pay for. So I decided I didn't want to depend on it and quit.

For funding timelines, I think the main question increasingly becomes: how much of the economical pie could be eaten by narrowly superhuman AI tooling? It doesn't take hitting an infinity/singularity/fast takeoff for plausible scenarios under this bearish reality to nevertheless squirm through the economy at Cowen-approved diffusion rates and gradually eat insane $$$ worth of value, and therefore, prop up 100b+ buildouts. OAI's latest sponsored pysop  leak today seems right in line with bullet point numero uno under real world predictions, that they a... (read more)

9Vladimir_Nesov
That's why I used the "no new commercial breakthroughs" clause, $300bn training systems by 2029 seem in principle possible both technically and financially without an intelligence explosion, just not with the capabilities legibly demonstrated so far. On the other hand, pre-training as we know it will end[1] in any case soon thereafter, because at ~current pace a 2034 training system would need to cost $15 trillion (it's unclear if manufacturing can be scaled at this pace, and also what to do with that much compute, because there isn't nearly enough text data, but maybe pre-training on all the video will be important for robotics). How far RL scales remains unclear, and even at the very first step of scaling o3 doesn't work as clear evidence because it's still unknown if it's based on GPT-4o or GPT-4.5 (it'll become clearer once there's an API price and more apples-to-apples speed measurements). ---------------------------------------- 1. This is of course a quote from Sutskever's talk. It was widely interpreted as saying it has just ended, in 2024-2025, but he never put a date on it. I don't think it will end before 2027-2028. ↩︎

My other comment was bearish, but in the bullish direction, I'm surprised Zvi didn't include any of Gwern's threads, like this or this, which apropos of Karpathy's blind test I think have been the best clear examples of superior "taste" or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer. 

As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.

But b... (read more)

Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with. 

Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, w... (read more)

Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likely retains this API behavior. In other words, the "depth" of the search appears to be reset each prompt, if we assume the model hasn't learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens. 

So I think it might be inaccurate to consider it as "investing 140s of search", or rather the implication ... (read more)

5gwern
I don't think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can't backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the 'move' is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.