Vladimir_Nesov

Wikitag Contributions

Comments

Sorted by

This might require an alien ASI that only pursues prevention of ASIs and doesn't exploit the cosmic wealth, otherwise the stars would be going out. Alternatively it could be falsifying astronomical observations and human eyesight, which is close to humanity living in a simulation outright.

For Claude 3.5, Amodei says the training time cost "a few $10M's", which translates to between 1e25 FLOPs (H100, $40M, $4/hour, 30% utilization, BF16) and 1e26 FLOPs (H100, $80M, $2/hour, 50% utilization, FP8), my point estimate is 4e25 FLOPs.

GPT-4o was trained around the same time (late 2023 to very early 2024), and given that the current OpenAI training system seems to take the form of three buildings totaling 100K H100s (the Goodyear, Arizona site), they probably had one of those for 32K H100s, which in 3 months at 40% utilization in BF16 gives 1e26 FLOPs.

Gemini 2.0 was released concurrently with the announcement of general availability of 100K TPUv6e clusters (the instances you can book are much smaller), so they probably have several of them, and Jeff Dean's remarks suggest they might've been able to connect some of them for purposes of pretraining. Each one can contribute 3e26 FLOPs (conservatively assuming BF16). Hassabis noted on some podcast a few months back that scaling compute 10x each generation seems like a good number to fight through the engineering challenges. Gemini 1.0 Ultra was trained on either 77K TPUv4 (according to The Information) or 14 4096-TPUv4 pods (according to EpochAI's quote from SemiAnalysis), so my point estimate for Gemini 1.0 Ultra is 8e25 FLOPs.

This gives 6e26-9e26 FLOPs for Gemini 2.0 (from 2-3 100K TPUv6e clusters). But unclear if this is what went into Gemini 2.0 Pro or if there is also an unmentioned Gemini 2.0 Ultra down the line.

Pretraining on a $150bn system in 2028 gives 150x compute compared to Grok 3 (which seems to be a 3e26 FLOPs model). We haven't seen what happens if DeepSeek-V3 methods are used in pretraining on a $5bn system that trained Grok 3 in 2025 (which would 100x its compute), or a $20bn system in 2026 (to further 8x the FLOPs).

Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance

I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn't work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.

lack of high quality training data

This is an example of a compute multiplier that doesn't scale, and the usual story is that there are many algorithmic advancements with the same character, they help at 1e21 FLOPs but become mostly useless at 1e24 FLOPs. The distinction between perplexity and benchmarks in measuring compute multipliers (keeping the dataset unchanged) might be a good proxy for predicting which is which.

you don't get performance that is significantly smarter than the humans who wrote the text in the pretraining data

Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.


  1. Before 2024, only OpenAI (and briefly Google) had a GPT-4 level model, while in 2024 GPT-4 level models became ubiquitous. This might explain how a series of reproductions of o1-like long reasoning performance followed in quick succession, in a way that doesn't significantly rely on secrets leaking from OpenAI. ↩︎

In principle sufficiently granular MoEs keep matrices at a manageable size, and critical minibatch size scales quickly enough in the first several trillion tokens of pretraining that relatively small scale-up world sizes (from poor inter-chip networking and weaker individual chips) is not a barrier. So unconscionable numbers of weaker chips should still be usable (at a good compute utilization) in frontier training going forward. Still a major hurdle, that is even more expensive and complicated.

I think there are currently 5 live players, Google, Anthropic, OpenAI, xAI, and Meta (but not DeepSeek and SSI), because frontier training compute is necessary and only these 5 seem to have a prospect of keeping up in 2025-2026. This can change if someone else gets enough funding or access to chips (as it quickly did with xAI), but that's still a major additional hurdle no matter how competent a company is in other ways.

Llama-3-405B, with known details and the handicap of being a dense model, demonstrates that the rumored compute multipliers of other AI companies don't have enough oomph to really matter. Probably the numbers like 4x per year refer to benchmark performance rather than perplexity, and so most of it doesn't directly help with general intelligence and doesn't scale when much more data becomes necessary with more compute. Low spread between different frontier AI companies is a similar observation.

They don't claim that Grok 3 was trained on 200K GPUs, and that can't actually be the case from other things they say. The first 100K H100s were done early Sep 2024, and the subsequent 100K H200s took them 92 days to set up, so early Dec 2024 at the earliest if they started immediately, which they didn't necessarily. But pretraining of Grok 3 was done by Jan 2025, so there wasn't enough time with the additional H200s.

There is also a plot where Grok 2 compute is shown slightly above that of GPT-4, so maybe 3e25 FLOPs. And Grok 3 compute is said to be either 10x or 15x that of Grok 2 compute. The 15x figure is given by Musk, who also discussed how Grok 2 was trained with less than 8K GPUs, so possibly he was just talking about the number of GPUs, as opposed to the 10x figure named by a team member that was possibly about the amount of compute. This points to 3e26 FLOPs for Grok 3, which on 100K H100s at 40% utilization would take 3 months, a plausible amount of time if everything worked on almost the first try.

Time needed to build a datacenter given the funding and chips isn't particularly important for timelines, only for catching up to the frontier (as long as it's 3 months vs. 6 months and not 18 months). Timelines are constrained by securing more funding for a training system, and designing and manufacturing better chips. Another thing on that presentation was a claim of starting work on another 1.2 GW GB200/GB300 datacenter, which translates to 600K chips. This appears to be more than other LLM labs will construct this year, which might be only about 0.5 GW, except for Google[1], but then Musk didn't name deadlines for 1.2 GW either. It's only more concrete than Meta's 2 GW site in specifying that the chips are Blackwell, so it can't be about plans for 2027 when better chips will be available.


  1. On a recent podcast, Jeff Dean stated more clearly that their synchronous multi-datacenter training works between metro areas (not just for very-nearby datacenters), and in Dec 2024 they've started general availability for 100K TPUv6e clusters. A TPUv6e has similar performance to an H100, and there are two areas being built up in 2025, each with 1 GW of Google datacenters near each other. So there's potential for 1M H100s or 400K B200s worth of compute, or even double that if these areas or others can be connected with sufficient bandwidth. ↩︎

A lot of free will confusions are sidestepped by framing decisions so that the agent thinks of itself as "I am an algorithm" rather than "I am a physical object". This works well for bounded individual decisions (rather than for long stretches of activity in the world), and the things that happen in the physical world can then be thought of as instantiations of the algorithm and its resulting decision, which the algorithm controls from its abstract headquarters that are outside of physical worlds and physical time.

For example, this way you don't control the past or the future, because the abstract algorithm is not located at some specific time, and all instances of it at various times within the physical world are related to the abstract algorithm in a similar way. For coordination of multiple possible worlds, an abstract algorithm is not anchored to a specific world, and so there is no additional conceptual strangeness of controlling one possible world from another, because in this framing you instead control both from the same algorithm that is not intrinsically part of either of them. There are also thought experiments where existence of an instance of the decision maker in some world depends on their own decision (so that for some possible decisions, the instance never existed in the first place), and extracting the decision making into an algorithm that's unbothered by nonexistence of its instances in real worlds makes this more straightforward.

everyone else is their slave

Post-AGI humans can't be centrally slaves, because human labor won't be valuable.

It's an argument about long reasoning traces having sufficient representational capacity to bootstrap general intelligence, not forecasting that the bootstrapping will actually occur. It's about a necessary condition for straightforward scaling to have a chance of getting there, at an unknown level of scale.

Load More