DeepSeek beats o1-preview on math, ties on coding; will release weights

Zach Stein-Perlman

51 DeepSeek beats o1-preview on math, ties on coding; will release weights

by Zach Stein-Perlman

20th Nov 2024

1 min read

51

DeepSeek-R1-Lite-Preview was announced today. It's available via chatbot. (Post; translation of Chinese post.)

DeepSeek says it will release the weights and publish a report.

The model appears to be stronger than o1-preview on math, similar on coding, and weaker on other tasks.

DeepSeek is Chinese. I'm not really familiar with the company. I thought Chinese companies were at least a year behind the frontier; now I don't know what to think and hope people do more evals and play with this model. Chinese companies tend to game benchmarks more than the frontier Western companies, but I think DeepSeek hasn't gamed benchmarks much historically.

The post also shows inference-time scaling, like o1:

Note that o1 is substantially stronger than o1-preview; see the o1 post:

(Parts of this post and some of my comments are stolen from various people-who-are-not-me.)

51

DeepSeek beats o1-preview on math, ties on coding; will release weights

New Comment

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:39 AM

[-]David Matolcsi8h226

I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don't think it's benchmark-gaming, at least in math. On the other hand, it's noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.

I think it's also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I'm confused in general.

A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.

From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn't really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.

[-]Daniel Kokotajlo7h2410

Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.

[-]nikola7h82

One weird detail I noticed is that in DeepSeek's results, they claim GPT-4o's pass@1 accuracy on MATH is 76.6%, but OpenAI claims it's 60.3% in their o1 blog post. This is quite confusing as it's a large difference that seems hard to explain with different training checkpoints of 4o.

[-]Lech Mazur2h20

It seems that 76.6% originally came from the GPT-4o announcement blog post. I'm not sure why it dropped to 60.3% by the time of o1's blog post.

[-]Zach Stein-Perlman5h40

Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)

[-]Zach Stein-Perlman5h20

Some context: June FT article on DeepSeek; parent hedge fund Wikipedia page.

Moderation Log