Vladimir_Nesov

Wiki Contributions

Comments

Sorted by

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s clusters is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households.

So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year[1].

This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy.


  1. If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yearly consumption of 1,700 average US households (again, ignoring the rest of the datacenter, which is not the correct thing to do). ↩︎

A posthuman king is not centrally a king (not mortal, very different incentives), and "an AI" is a very vague bag-of-everything that might include things like simulated worlds or bureaucracies with checks and balances as special cases. The reason His Majesty's Democratic Government doesn't really work while the king retains ultimate authority is that the next king can be incompetent or malevolent, or its activities start threatening the king's position and so the king is motivated to restrict them. So even "giving keys to the universe back" is not necessarily that important in the case of a posthuman god-king, but it remains a possibility after the acute risk period passes and it's more clear how to make the next thing work.

I do think that these things are relevant to 'compute it takes to get to a given capability level'.

In practice, there are no 2e23 FLOPs models that cost $300K to train that are anywhere close to Llama-3-405B smart. If there were such models at leading labs (based on unpublished experimental results and more algorithmic insights), they would be much smarter than Llama3-405B when trained with 8e25 FLOPs they have to give, rather than the reference 2e23 FLOPs. Better choice of ways of answering questions doesn't get us far in the actual technical capabilities.

(Post-training like o1 is a kind of "better choice of ways of answering questions" that might help, but we don't know how much compute it saves. Noam Brown gestures at 100,000x from his earlier work, but we haven't seen Llama 4 yet, it might just spontaneously become capable of coherent long reasoning traces as a result of more scale, the bitter lesson making Strawberry Team's efforts moot.)

Many improvements observed at smaller scale disappear at greater scale, or don't stack with each other. Many papers have horrible methodologies, plausibly born of scarcity of research compute, that don't even try (or make it possible) to estimate the compute multiplier. Most of them will be eventually forgotten, for a good reason. So most papers that seem to demonstrate improvements are not strong evidence for the hypothesis of a 1000x cumulative compute efficiency improvement, while this hypothesis predicts observations about what's actually already possible in practice that we are not getting, strong evidence against it. There are multiple competent teams that don't have Microsoft compute, and they don't win over Llama-3-405B, which we know doesn't have all of these speculative algorithmic improvements and uses 4e25 FLOPs (2.5 months on 16K H100s rather than 1.5 months on 128 H100s for 2e23 FLOPs).

In other words, the importance of Llama-3-405B for the question about speculative algorithmic improvements is that the detailed report shows it has no secret sauce, it merely competently uses about as much compute as the leading labs in very conservative ways. And yet it's close in capabilities to all the other frontier models. Which means the leading labs don't have significantly effective secret sauce either, which means nobody does, since the leading labs would've already borrowed it if it was that effective.

There's clearly a case in principle for it being possible to learn with much less data, anchoring to humans blind from birth. But there's probably much more compute happening in a human brain per the proverbial external data token. And a human has the advantage of not learning everything about everything, with greater density of capability over encyclopedic knowledge, which should help save on compute.

I'm talking about the compute multiplier, as a measure of algorithmic improvement, how much less compute it takes to get to the same place. Half of these things are not relevant to it. Maybe another datapoint, Mosaic's failure with DBRX, when their entire thing was hoarding compute multipliers.

Consider Llama-3-405B, a 4e25 FLOPs model that is just Transformer++ from the Mamba paper I referenced above, not even MoE. A compute multiplier of 1000x over the original transformer would be a 200x multiplier over this Llama, meaning matching its performance with 2e23 FLOPs (1.5 months of training on 128 H100s). Yi-Lightning is exceptional for its low 2e24 FLOPs compute (10x more than our target), but it feels like a lot of it is better post-training, subjectively it doesn't appear quite as smart, so it would probably lose the perplexity competition.

Bug: I can no longer see the number of agreement-votes (which is distinct from the number of Karma-votes). It shows the Agreement Downvote tooltip when hovering over the agreement score (the same for Karma score works correctly, saying for example "This comment has 31 overall karma (17 Votes)").

Edit: The number of agreement votes can be seen when hovering over two narrow strips, probably 1 pixel high, one right above and one right below the agreement rating.

but have so far only found relatively incremental improvements to transformers (in the realm of 1000x improvement)

What 1000x improvement? Better hardware and larger scale are not algorithmic improvements. Careful study of scaling laws to get Chinchilla scaling and set tokens per parameter more reasonably[1] is not an algorithmic improvement. There was maybe 5x-20x algorithmic improvement, meaning the compute multiplier, how much less compute one would need to get the same perplexity on some test data. The upper bound is speculation based on published research for which there are no public results of large scale experiments, including for combinations of multiple methods, and absence of very strong compute multiplier results from developers of open weights models who publish detailed reports like DeepSeek and Meta. The lower bound can be observed in the Mamba paper (Figure 4, Transformer vs. Transformer++), though it doesn't test MoE over dense transformer (which should be a further 2x or so, but I still don't know of a paper that demonstrates this clearly).

Recent Yi-Lightning is an interesting example that wins on Chatbot Arena in multiple categories over all but a few of the strongest frontier GPT-4 level models (original GPT-4 itself is far behind). It was trained for about 2e24 FLOPs, 10x less than original GPT-4, and it's a small overtrained model, so its tokens per parameter are very unfavorable, that is it was possible to make it even more capable with the same compute.


  1. It's not just 20 tokens per parameter. ↩︎

The point is that the "controller" of a "controllable AI" is a role that can be filled by an AI and not only by a human or a human institution. AI is going to quickly grow the pie to the extent that makes current industry and economy (controlled by humans) a rounding error, so it seems unlikely that among the entities vying for control over controllable AIs, humans and human institutions are going to be worth mentioning. It's not even about a takeover, Google didn't take over Gambia.

If your work makes AI systems more controllable, who will ultimately wield that control?

A likely answer is "an AI".

Load More