I think this should somewhat update people away from "we can prevent model weights from being stolen by limiting the outgoing bandwidth from the data center", if that protection is assuming that model weights are very big and [the dangerous part] can't be made smaller.
I'd also bet that, even if Deep Seek turns out to be somehow "fake" (optimized for benchmarks in some way) (not that this currently seems like the situation), some other way of making at least the dangerous[1] parts of a model much smaller[2] will be found and known[3] publicly.
If someone is stealing a model, they probably care about "dangerous" capabilities like ML engineering and the ability to autonomously act in the world, but not about "not dangerous" capabilities like memorizing Harry Potter and all its fan fictions. If you're interested to bet with me, I'd probably let you judge what is and isn't dangerous. Also, as far as I can tell, Deep Seek is much smaller without giving up a lot of knowledge, so the claim I'm making in this bet is even weaker
At least 10x smaller, but I'd also bet on 100x at some odds
This sets a lower bar for the secret capabilities a nation state might have if they're trying to steal model weights that are defended this way. So again, I expect the attack we'd actually see against such a plan to be even stronger
Yes, but I think the larger update is that recent models from OpenAI are likely quite small and inference time compute usage creates more an incentive for small models. It seems likely that (e.g.) o1-mini is quite small given that it generates at 220 tokens per second(!), perhaps <30 billion active parameters based on the link from epoch given earlier. I'd guess (idk) 100 billion params. Likely something similar holds for o3-mini.
(I think the update from deepseek in particular might be smaller than you think as export controls create an artifical incentive for smaller models.)
they prefer deepseek for erotic RPs? [T]hat seems kind of disturbing to me.
I've not been following these people, and only know Pliny for his jailbreaking prompts, so I don't have context for this remark. Why would this be disturbing? Is it worry about China overtaking the US, open models competing favorably with closed ones, or that LLMs are being used for such unsafe[1] activities at all?
Due to copyright infringement, terms-of-service violation, existential risk, nonconsensual/underage sexual activity, catastrophic harms, or some such bullshit.
None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.
What should we make of DeepSeek v3?
DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.
According to the benchmarks, it can play with GPT-4o and Claude Sonnet.
Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.
So what do we have here? And what are the implications?
Table of Contents
What is DeepSeek v3 Techncially?
I’ve now had a chance to read their technical report, which tells you how they did it.
This is in sharp contrast to what we saw with the Llama paper, which was essentially ‘yep, we did the transformer thing, we got a model, here you go.’ DeepSeek is cooking.
Our Price Cheap
It was a scarily cheap model to train, and is a wonderfully cheap model to use.
Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.
Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.
Run Model Run
The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.
My presumption is that DeepSeek v3 decided It Had One Job. That job was to create a model that was as cheap to train and run as possible when integrated with a particular hardware setup. They did an outstanding job of that, but when you optimize this hard in that way, you’re going to cause issues in other ways, and it’s going to be Somebody Else’s Problem to figure out what other configurations work well. Which is fine.
Talent Search
Before we get to capabilities assessments: We have this post about them having a pretty great company culture, especially for respecting and recruiting talent.
We also have this thread about a rival getting a substantial share price boost after stealing one of their engineers, and DeepSeek being a major source of Chinese engineering talent. Impressive.
The Amazing Incredible Benchmarks
Check it out, first compared to open models, then compared to the big guns.
No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.
The question now is how these benchmarks translate to practical performance, or to potentially dangerous capabilities, and what this says about the future. Benchmarks are good negative selection. If your benchmarks suck then your model sucks.
But they’re not good positive selection at the level of a Claude Sonnet.
My overall conclusion is: While we do have ‘DeepSeek is better than 4o on most benchmarks at 10% of the price,’ what we don’t actually have is ‘DeepSeek v3 outperforms Sonnet at 53x cheaper pricing.’
CNBC got a bit hoodwinked here.
It’s a great model, sir, it has its cake, but it does not get to eat it, too.
One other benchmark where the model excels is impossible to fake: The price.
Underperformance on AidanBench
A key private benchmark where DeepSeek v3 underperforms is AidanBench:
There’s some weirdness in the rest of the Aidan ratings, especially in comparing the o1-style models (o1 and Thinking) to the others, but this seems like it’s doing various good work, but is not trying to be a complete measure. It’s more measuring ability to create diverse outputs while retaining coherence. And DeepSeek v3 is bad at this.
I don’t see this as ‘the problem is AidanBench’ so much as ‘DeepSeek is indeed quite poor at the thing AidanBench is measuring.’ As Tortaxes notes it’s got terrible output diversity and this is indeed a problem.
Indeed, one could argue that this will cause the model to overperform on standard benchmarks. As in, most benchmarks care about getting a right output, so ‘turning the temperature down too low’ in this way will actively help you, whereas in practice this is a net negative.
DeepSeek is presumably far better than its AidanBench score. But it does represent real deficits in capability.
Model in the Arena
We’re a long way from when Arena was the gold standard test, but it’s still useful.
DeepSeek’s Arena performance is impressive here, with the usual caveats that go with Arena rankings. It’s a data point, it measures what it measures.
Other Private Benchmarks
Here is another private benchmark where DeepSeek v3 performs well for its weight class, but underperforms relative to top models or its headline benchmarks:
It is odd that Claude Haiku does so well on that test. Other ratings all make sense, though, so I’m inclined to find it meaningful.
A traditional simple benchmark to ask new LLMs is Which version is this?’
Riley Goodside tried asking various models, DeepSeek nailed this (as does Sonnet, many others do variously not as good.) Alas, then Lucas Beyer reran the test 8 times and only it claimed to be GPT-4 five times out of eight.
That tells several things, one of which is ‘they did not explicitly target this question effectively.’ Largely it’s telling you about the data sources, a hilarious note is that if you ask Gemini Pro in Chinese it sometimes thinks it is WenXinYiYan from Baidu.
This doesn’t have to mean anyone trained directly on other model outputs, because statements that an AI is GPT-4 are all over the internet. It does suggest less than ideal data filtering.
Anecdata
As usual, I find the anecdata reports enlightening, here are the ones that crossed my desk this week, I typically try to do minimal filtering.
Taelin is impressed, concluding that Sonnet is generally smarter but not that much smarter, while DeekSeek outperforms GPT-4o and Gemini-2.
Here are some people who are less impressed:
There’s somewhat of an infinite repetition problem (thread includes example from coding.)
On creative writing Quintin Pope reports it follows canon well but is not as good at thinking about things in general – but again note that we are doing a comparison to Sonnet.
There’s also this, make of it what you will.
One underappreciated test is, of course, erotic fiction.
I presume that a lot of people are not especially looking to do all the custom work themselves. For most users, it’s not about money so much as time and ease of use, and also getting easy access to other people’s creations so it feels less like you are too much in control of it all, and having someone else handle all the setup.
For the power users of this application, of course, the sky’s the limit. If one does not want to blatantly break terms of service on and jailbreak Sonnet or Opus, this seems like one place DeepSeek might then be the best model. The others involve taking advantage of it being open, cheap or both.
If you’re looking for the full Janus treatment, here you go. It seems like it was a struggle to get DeepSeek interested in Janus-shaped things, although showing it Opus outputs helped, you can get it ‘awake’ with sufficient effort.
Implications and Policy
It is hard to know exactly where China is in AI. What is clear is that while they don’t have top-level large frontier models, they are cooking a variety of things and their open models are generally impressive. What isn’t clear is how much of claims like this are accurate.
When the Chinese do things that are actually impressive, there’s no clear path to us hearing about it in a way we can trust, and when there are claims we have learned we can’t trust those claims in practice. When I see lists like the one below, I presume the source is rather quite biased – but Western sources often will outright not know what’s happening.
I can’t say with confidence that these other companies aren’t doing the ‘best’ at these other things. It is possible. I notice I am rather skeptical.
I found this take from Tyler Cowen very strange:
I do not remember when “they” told us China had no interest in doing this, for any contextually sensible value of this. Of course China would like to produce a high-quality model, and provide good ease of use and interface in the sense of ‘look here’s a chat window, go nuts.’ No one said they wouldn’t try. What “they” sometimes said was that they doubted China would be successful.
I do agree that this model exceeds expectations, and that adjustments are in order.
So, what have we learned from DeepSeek v3 and what does it all mean?
We should definitely update that DeepSeek has strong talent and ability to execute, and solve difficult optimization problems. They cooked, big time, and will continue to cook, and we should plan accordingly.
This is an impressive showing for an aggressive mixture of experts model, and the other techniques employed. A relatively small model, in terms of training cost and active inference tokens, can do better than we had thought.
It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.
We then get to the policy side. If this is what you can get for $5.5 million, how can we hope to regulate foundation models, especially without hitting startups? If DeepSeek is determined to be open including their base models, and we have essentially no leverage on them, is it now impossible to hope to contain any catastrophic risks or other dangerous capabilities? Are we now essentially in an unwinnable situation, where our hand is forced and all we can do is race ahead and hope for the best?
First of all, as is often the case, I would say: Not so fast. We shouldn’t assume too much about what we do or do not have here, or about the prospects for larger training runs going forward either. There was a bunch of that in the first day or two after the announcement, and we will continue to learn more.
No matter what, though, this certainly puts us in a tough spot. And it gives us a lot to think about.
One thing it emphasizes is the need for international cooperation between ourselves and China. Either we work together, or neither of us will have any leverage over many key outcomes or decisions, and to a large extent ‘nature will take its course’ in ways that may not be compatible with our civilization or human survival. We urgently need to Pick Up the Phone. The alternative is exactly being locked into The Great Race, with everything that follows from that, which likely involves even in good scenarios sticking various noses in various places we would rather not have to stick them.
I definitely don’t think this means we should let anyone ‘off the hook’ on safety, transparency or liability. Let’s not throw up our hands and make the problem any worse than it is. Things got harder, but that’s the universe we happen to inhabit.
Beyond that, yes, we all have a lot of thinking to do. The choices just got harder.