All of Fergus Argyll's Comments + Replies

I think he was saying:

By the time the new chip is ready, that will be 1.5 years which implies 5x growth if we assume 3x per year. So; by the time OpenBrain is ready to build the next datacenter, we're in middle/late 2026 instead of beginning of 26.

Aside from that, the idea that investment will scale proportionally seems like a huge leap of faith. If the next training run does not deliver the goods there is no way softbank et al. pour in 100B.

I have a question that I didn't see anyone ask, but I don't frequent this site enough to know if it was mentioned somewhere.

Are we sure there will be a 2-OOMs-bigger training run at all?

After the disappointment that was GPT 4.5, will investors give them the 100B (according to TFA) they need for that? In general I'd like to see more discussion about the financial side of the AGI Race. How will OpenBrain get the funding to train Agent-4?

I've been looking for markets on manifold to bet this and I couldn't find a good one. I would bet we don't get a 2-OOMs-Big... (read more)

7Vladimir_Nesov
GPT-4.5 might've been trained on 100K H100s of the Goodyear Microsoft site ($4-5bn, same as first phase of Colossus), about 3e26 FLOPs (though there are hints in the announcement video it could've been trained in FP8 and on compute from more than one location, which makes up to 1e27 FLOPs possible in principle). Abilene site of Crusoe/Stargate/OpenAI will have 1 GW of Blackwell servers in 2026, about 6K-7K racks, possibly at $4M per rack all-in, for the total of $25-30bn, which they've already raised money for (mostly from SoftBank). They are projecting about $12bn in revenue for 2025. If used as a single training system, it's enough to train models for 5e27 BF16 FLOPs (or 1e28 FP8 FLOPs). The AI 2027 timeline assumes reliable agentic models work out, so revenue continues scaling, with the baseline guess of 3x per year. If Rubin NVL144 arrives 1.5 years after Blackwell NVL72, that's about 5x increase in expected revenue. If that somehow translates into proportional investment in datacenter construction, that might be enough to buy $150bn worth of Rubin NVL144 racks, say at $5M per rack all-in, which is 30K racks and 5 GW. Compared to Blackwell NVL72, that's 2x more BF16 compute per rack (and 3.3x more FP8 compute). This makes the Rubin datacenter of early 2027 sufficient to train a 5e28 BF16 FLOPs model (or 1.5e29 FP8 FLOPs) later in 2027. Which is a bit more than 100x the estimate for GPT-4.5. (I think this is borderline implausible technologically if only the AI company believes in the aggressive timeline in advance, and ramping Rubin to 30K racks for a single company will take more time. Getting 0.5-2 GW of Rubin racks by early 2027 seems more likely. Using Blackwell at that time means ~2x lower performance for the same money, undercutting the amount of compute that will be available in 2027-2028 in the absence of an intelligence explosion, but at least it's something money will be able to buy. And of course this still hinges on the revenue actually continuing

If I understand correctly (I very well might not), A "one bit LLM" has to be trained as a "one bit LLM" in order to then run inference on it as a "one bit LLM". I.e this isn't a new Quantization scheme.

So I think training and inference are tied together here, meaning; if this replicates, works, etc. we will probably have new hardware for both stages

1lemonhope
I don't see them mention anything about training efficiency anywhere so I don't think it is really legit 1.58 bit training in a meaningful sense