Vladimir_Nesov

Wiki Contributions

Comments

Sorted by

DeepSeek-V3 is a MoE model with 37B active parameters trained for 15T tokens, so at 400 tokens per parameter it's very overtrained and could've been smarter with similar compute if hyperparameters were compute optimal. It's probably the largest model known to be trained in FP8, it extracts 1.4x more compute per H800 than most models trained in BF16 get from an H100, for about 6e24 FLOPs total[1], about as much as Llama-3-70B. And it activates 8 routed experts per token (out of 256 total routed experts), which a Feb 2024 paper[2] suggests to be a directionally correct thing to do (compared to a popular practice of only activating 2 experts), with about 64 experts per token being optimal around 1e24-1e25 FLOPs. Taken together, these advantages predict that it should be smarter than Llama-3-70B, if done well.

Models that are smarter than Llama-3-70B can show impressive benchmark performance that then doesn't cash out in the hard-to-operationalize impression of being as smart as Claude 3.5 Sonnet. The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon. It would be shocking if a 37B active parameter model actually manages that though.


  1. H800 seems to produce 1.4e15 dense FP8 FLOP/s, the model was trained for 2.8e6 H800-hours, and I'm assuming 40% compute utilization. ↩︎

  2. That same paper estimates the compute multiplier of a compute optimal MoE at about 20x compared to a dense model, see Figure 1b, which is hard to believe. It's based on experiments of up to about 3e19-4e20 FLOPs per datapoint. Still, the claim of many more activated experts than 2 being better might survive in practice. ↩︎

Answer by Vladimir_Nesov160

Aggregating from independent reasoning traces is a well-known technique that helps somewhat but quickly plateaus, which is the reason o1/o3 are an important innovation, they use additional tokens much more efficiently and reach greater capability, as long as those tokens are within a single reasoning trace. Once a trace is done, more compute can only go to consensus or best-of-k aggregation from multiple traces, which is more wasteful in compute and quickly plateaus.

The $4000 high resource config of o3 for ARC-AGI was using 1024 traces of about 55K tokens, the same length as with the low resource config that runs 6 traces. Possibly longer reasoning traces don't work yet, otherwise a pour money on the problem option would've used longer traces. So a million dollar config would just use 250K reasoning traces of length 55K, which is probably slightly better than what 1K traces produce already.

I think explicitly computing details in full (as opposed to abstract reasoning about approximate properties) has no bearing on moral weight (degree of being real), but some kind of computational irreducibility forces the simulation of interesting things to get quite close to low level detail in order to figure out most global facts about what's going on there, such as values/culture of people living in a world after significant time passes.

They've probably scaled up 2x-4x compared to the previous scale of about 8e25 FLOPs, it's not that far (from 30K H100 to 100K H100). One point as I mentioned in the post is inability to reduce minibatch size, which might make this scaling step even less impactful than it should be judging from compute alone, though that doesn't apply to Google.

In any case this doesn't matter yet, since the 1 GW training systems are already being built (in case of Nvidia GPUs with larger scale-up worlds of GB200 NVL72), the decision to proceed to the yet-unobserved next level of scaling doesn't depend on what's observed right now. The 1 GW training systems allow training up to about 5e27 FLOPs, about 60x[1] the currently deployed models, a more significant change. We'll see its impact in late 2026.


  1. The number of chips increases 5x from 100K H100 to 500K B200, and the new chips are 2.5x faster. If 1 GW systems are not yet expected to be quickly followed by larger systems, more time will be given to individual frontier model training runs, let's say 1.5x more. And there was that 3x factor from 30K H100 to 100K H100. ↩︎

It's as efficient to work on many frames while easily switching between them. Some will be poorly developed, but won't require commitment and can anchor curiosity, progress on blind spots of other frames.

Don't just disagree and walk away!

Feeding this norm creates friction, filters evidence elicited in the agreement-voting. If there is a sense that a vote needs to be explained, it often won't be cast.

Are there any signs to be found in public that anyone is training 10B+ LLMs in a precision that is not 16 bits? There are experiments that are specifically about precision on smaller LLMs, but they don't seem to get adopted in practice for larger models, despite the obvious advantage of getting to 2x the compute.

In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.

Pipeline parallelism doesn't reduce batch size, it just moves the processing of a given sequence around the cluster in stages, but the number of sequences being processed by the cluster at a given time doesn't change (the time needed to process a layer for some sequence doesn't change, so the time between optimizer steps doesn't change, other than through bubbles). Tensor parallelism spreads the processing of a sequence across multiple GPUs, so there are fewer sequences processed at once within the cluster, which can be used to reduce the batch size (the time needed to process a layer for some sequence is divided by degree of tensor parallelism, so the time between optimizer steps reduces, and so does the total compute expended in a batch, proportional to the total number of sequences in it). You can only do tensor parallelism within a scale-up world without murdering compute utilization, which puts a bound on how much you can reduce the batch size.

I believe the l3 paper indicates the training seqlen was increased mid-training.

Section 3.4 says they start with sequences of length 4K, move to sequences of length 8K after 250M tokens, then to 16M tokens per batch after 2.9T tokens, and finally to long context training in the last 800B tokens (out of about 15T tokens in total). So 11T out of 15T tokens were learned in batches of 2K sequences of length 8K.

I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations

Good catch, TP=32 on 400K Trn2 gives the same batch size as TP=8 on 100K H100, so there is only an advantage with TP=64, which is not a priori a sure thing to work well. And a hypothetical non-Ultra 400K Trn2 cluster with its 16 GPU scale-up worlds is worse even though there's more compute in 16 Trn2 than in 8 H100. Though it would be surprising if the Rainier cluster doesn't have the Ultra config, as what else is it supposed to be for.

given that this is RL, there isn't any clear reason this won't work (with some additional annoyances) for scaling through very superhuman performance

Not where they don't have a way of generating verifiable problems. Improvement where they merely have some human-written problems is likely bounded by their amount.

Answer by Vladimir_Nesov195

An AGI broadly useful for humans needs to be good at general tasks for which currently there is no way of finding legible problem statements (where System 2 reasoning is useful) with verifiable solutions. Currently LLMs are slightly capable at such tasks, and there are two main ways in which they become more capable, scaling and RL.

Scaling is going to continue rapidly showing new results at least until 2026-2027, probably also 2028-2029. If there's no AGI or something like a $10 trillion AI company by then, there won't be a trillion dollar training system and the scaling experiments will fall back to the rate of semiconductor improvement.

Then there's RL, which as o3 demonstrates applies to LLMs as a way of making them stronger and not merely eliciting capabilities formed in pretraining. But it only works directly around problem statements with verifiable solutions, and it's unclear how to generate them for more general tasks or how far will the capabilities generalize from the training problems that are possible to construct in bulk. (Arguably self-supervised learning is good at instilling general capabilities because the task of token prediction is very general, it subsumes all sorts of things. But it's not legible.) Here too scale might help with generalization stretching further from the training problems, and with building verifiable problem statements for more general tasks, and we won't know how much it will help until the experiments are done.

So my timelines are concentrated on 2025-2029, after that the rate of change in capabilities goes down. Probably 10 more years of semiconductor and algorithmic progress after that are sufficient to wrap it up though, so 2040 without AGI seems unlikely.

Load More