Taken in isolation, DeepSeek-V3 looks like a 15x compute multiplier. But if a lot of it is data, the multiplier won't scale (when you need much more data, it necessarily becomes worse, or alternatively you need a teacher model that's already better). In any case, this raises the ceiling for what 5 GW training systems can do (at which point there's either almost-AGI or scaling slows down a lot). And there the 15x multiplier of DeepSeek-V3 needs to be compared with the algorithmic advancements of 2025-2028, which would've included most of the things in DeepSeek-V3 anyway, so the counterfactual impact is small.
32B active parameters instead of likely ~220B for GPT4
It's 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn't an issue, why not 8B active parameters, or 1M active parameters?
32B active parameters instead of likely ~220B for GPT4 => 6.8x lower training ... cost
Doesn't follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
The training costs are maybe 5e24 FLOPs and 2e25 FLOPs, differ by 4x. DeepSeek-V3 is better than original GPT-4 though, you need to compare with GPT-4o, which almost certainly uses more compute in training than original GPT-4 (maybe 4x more, so maybe 16x more than DeepSeek-V3).
8bits training instead of 16bits => 4x lower training cost
FLOP/s for FP8 are almost always 2x the FLOP/s for BF16, not 4x.
Multi-token training => ~2x training efficiency
You still train on every token. There is an additional "layer" in model parameters that predicts the token-after-next (Figure 3 in the paper), so there's a bit of overhead in training (not much, with 61 total layers). The results are better, but not that much better (Table 4).
training on O1 outputs
Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.
Imitation helps with post-training, but the compute-heavy part is pretraining, and obtaining good quality with little pretraining is a novel feat that isn't known to be explainable by good post-training, or by including a lot of outputs from good models in the pretraining/annealing mix.
This seems unlikely to be a neglected concern, unless there are specific signs that it is.
could end up being the most important thing I’ve ever written
The $6 million is disputed by a video arguing that DeepSeek used far more compute than they admit to.
The prior reference is a Dylan Patel tweet from Nov 2024, in the wake of R1-Lite-Preview release:
Deepseek has over 50k Hopper GPUs to be clear.
People need to stop acting like they only have that 10k A100 cluster.
They are omega cracked on ML research and infra management but they aren't doing it with that many fewer GPUs
DeepSeek explicitly states that
DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training.
This seems unlikely to be a lie, the reputational damage would've motivated not mentioning amount of compute instead, but the most interesting thing about DeepSeek-V3 is precisely this claim, that its quality is possible with so little compute.
Certainly designing the architecture, the data mix, and the training process that made it possible required much more compute than the final training run, so in total it cost much more to develop than $6 million. And the 50K H100/H800 system is one way to go about that, though renting a bunch of 512-GPU instances from various clouds probably would've sufficed as well.
Found the following in the Jan 23 newsletter:
AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.
What actually happens with xAI and Anthropic compute by end of 2025 is less clear. For xAI, 300K B200s figure was mentioned in June 2024. For Anthropic, Amodei said in a recent interview that
I would not be surprised if in 2026 we have more than a million of some kind of chip.
Meanwhile, xAI will have a 200K H100/H200 system, and Anthropic a 400K Trn2 system, which is about 250K H100s worth of FLOP/s (ready by a few months into 2025). The 400-600 MW at Abilene site for OpenAI are 200K-300K B200s, which is about 500K-750K H100s worth of FLOP/s.
Google might start 2026 with the largest training system among the big labs, by a factor of about 2x, at about 1 GW.
OpenAI/Microsoft Stargate schism suggests that compute being built this year by Microsoft is unlikely to form part of a geographically distributed training system that also includes compute being built at Abilene site. Seems like OpenAI will be building its own training systems (through Stargate), while Microsoft will be serving inference (and possibly generation for RL training, but it remains unclear if it can be an important fraction of pretraining budget in 2025-2026). Thus only 400-600 MW of GB200s by end of 2025 for an OpenAI training system, not 1 GW.
Meta announced a 2 GW datacenter at Richland Parish site, but 1 GW for 2025 seems to be across all datacenters, not for a single training system. So the training system will be smaller by end of 2025.
The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so we don't need that much compute.
But if quality instead improves soon, and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn't work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.