Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn't artificially limit it.
Additionally, by saying
>can't be repeated even with Rubin Ultra NVL576
I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable.
>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference)
Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don't know what you mean by this?
Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing...
This is the most moving piece I've read since Situational Awareness. Bravo! Emotionally, I was particularly moved by the final two sentences in the "race" ending -- hats off to that bittersweet piece of prose. Materially, this is the my favorite holistic amalgamation of competently weighted data sources and arguments woven into a cohesive narrative, and personally has me meaningfully reconsidering some of the more bearish points in my model (like the tractability of RL on non-verifiability: Gwern etc have made individual points, but something about this co...
How to square these claims with the GLP-1 drug consumption you've stated in a previous post? I'd wager this powers the pareto majority of your perceived ease of leanness vs average population, and that you are somewhat pointlessly sacrificing the virtues of variety in reverse-pareto fashion.
For funding timelines, I think the main question increasingly becomes: how much of the economical pie could be eaten by narrowly superhuman AI tooling? It doesn't take hitting an infinity/singularity/fast takeoff for plausible scenarios under this bearish reality to nevertheless squirm through the economy at Cowen-approved diffusion rates and gradually eat insane $$$ worth of value, and therefore, prop up 100b+ buildouts. OAI's latest sponsored pysop leak today seems right in line with bullet point numero uno under real world predictions, that they a...
My other comment was bearish, but in the bullish direction, I'm surprised Zvi didn't include any of Gwern's threads, like this or this, which apropos of Karpathy's blind test I think have been the best clear examples of superior "taste" or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But b...
Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with.
Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, w...
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likely retains this API behavior. In other words, the "depth" of the search appears to be reset each prompt, if we assume the model hasn't learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as "investing 140s of search", or rather the implication ...
The system card also contains some juicey regression hidden within the worst-graph-of-all-time in the SWE-Lancer section:
If you can knobble together the will to work through the inane color scheme, it is very interesting to note that while the expected RL-able IC tasks show improvements, the Manager improvements are far less uniform, and particularly o1 (and 4o!) remains the stronger performer vs o3 when weighted by the (ahem, controversial) $$$ based benchmark. And this is all within the technical field of (essentially) system design, with verifiable answ... (read more)