artemium's Shortform

artemium

This is a special post for quick takes by artemium. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

A new open-source model has been announced by the Chinese lab DeepSeek: DeepSeek-V3. It reportedly outperforms both Sonnet 3.5 and GPT-4o on most tasks and is almost certainly the most capable fully open-source model to date.

Beyond the implications of open-sourcing a model of this caliber, I was surprised to learn that they trained it using only 2,000 H800 GPUs! This suggests that, with an exceptionally competent team of researchers, it’s possible to overcome computational limitations.

Here are two potential implications:

Sanctioning China may not be effective if they are already capable of training cutting-edge models without relying on massive computational resources.
We could be in a serious hardware overhang scenario, where we already have sufficient compute to build AGI, and the only limiting factor is engineering talent.

(I am extremely uncertain of this, it was just my reaction after reading about it)

DeepSeek-V3 is a MoE model with 37B active parameters trained for 15T tokens, so at 400 tokens per parameter it's very overtrained and could've been smarter with similar compute if hyperparameters were compute optimal. It's probably the largest model known to be trained in FP8, it extracts 1.4x more compute per H800 than most models trained in BF16 get from an H100, for about 6e24 FLOPs total^[1], about as much as Llama-3-70B. And it activates 8 routed experts per token (out of 256 total routed experts), which a Feb 2024 paper^[2] suggests to be a directionally correct thing to do (compared to a popular practice of only activating 2 experts), with about 64 experts per token being optimal around 1e24-1e25 FLOPs. Taken together, these advantages predict that it should be smarter than Llama-3-70B, if done well.

Models that are smarter than Llama-3-70B can show impressive benchmark performance that then doesn't cash out in the hard-to-operationalize impression of being as smart as Claude 3.5 Sonnet. The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon. It would be shocking if a 37B active parameter model actually manages that though.

H800 seems to produce 1.4e15 dense FP8 FLOP/s, the model was trained for 2.8e6 H800-hours, and I'm assuming 40% compute utilization. ↩︎
That same paper estimates the compute multiplier of a compute optimal MoE at about 20x compared to a dense model, see Figure 1b, which is hard to believe. It's based on experiments of up to about 3e19-4e20 FLOPs per datapoint. Still, the claim of many more activated experts than 2 being better might survive in practice. ↩︎

The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon.

Fyi, it's also available on https://chat.deepseek.com/, as is their reasoning model DeepSeek-R1-Lite-Preview ("DeepThink"). (I suggest signing up with a throwaway email and not inputting any sensitive queries.) From quickly throwing it a few requests I recently asked 3.5 Sonnet, DeepSeek-V3 seems slightly worse, but nonetheless solid.

Anyone tried clawdbot yet? Tried it today after hearing a lot of hype on X and it really gives me the vibe of the drop-in virtual remote worker from Situational Awareness. It is basically open source agent that takes full control over your local computer environment and communicates with you in human like fashion over WhatsApp, Telegram, Slack. I know that there were a lot of attempts to try this kind of agents before but this is the first time that really seem to work.

Not sure what implications of this kind of tools are for AI timelines, but it really does seem like one of the crucial pieces of the fully automated economy has just been deployed.

Here are two potential implications:

Sanctioning China may not be effective if they are already capable of training cutting-edge models without relying on massive computational resources.
We could be in a serious hardware overhang scenario, where we already have sufficient compute to build AGI, and the only limiting factor is engineering talent.

(I am extremely uncertain of this, it was just my reaction after reading about it)

H800 seems to produce 1.4e15 dense FP8 FLOP/s, the model was trained for 2.8e6 H800-hours, and I'm assuming 40% compute utilization. ↩︎
That same paper estimates the compute multiplier of a compute optimal MoE at about 20x compared to a dense model, see Figure 1b, which is hard to believe. It's based on experiments of up to about 3e19-4e20 FLOPs per datapoint. Still, the claim of many more activated experts than 2 being better might survive in practice. ↩︎

The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon.

LESSWRONG
LW

LESSWRONG
LW

artemium's Shortform

3