MiloSal's Shortform

MiloSal

MiloSal's Shortform

1st Feb 2025

1 min read

2

This is a special post for quick takes by MiloSal. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:41 PM

[-]MiloSal1mo30

What is o3's base model?

To create DeepSeek-R1, they:

Start with DeepSeek-V3-Base as a base model
Fine-tune base model on synthetic long CoT problem solving examples
Run RL to convergence on challenging verifiable math/coding/etc. problems, with reward for (a) formatting and (b) correctness

Therefore, I roughly expect o1's training process was:

Start with 4o as a base model
Some sort of SFT on problem solving examples
Run RL on verifiable problems with some similar reward setup.

An important question for the near-term scaling picture is whether o3 uses 4o as its base model. This question arises because we need some way to explain the capability gains from o1 to o3. A convenient explanation is that o3 was trained using approximately the same process as above, except the base model is something like GPT-4.5 or GPT-5.

However, some recent evidence has come to light against this view. As a friend points out, o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023). This seems like strong evidence that o3 uses 4o as the base model. Additionally, I would expect o3 to be more performant than it currently is if it used GPT-5 as a base model.

My current best guess is that o3 actually comes from a process like this:

Start with 4o+ as a base model (that is, 4o fine-tuned with some o1 distillation)
Some sort of SFT on problem solving examples, as before
A somewhat improved RL setup, again on verifiable problems. I am imagining a setup that also takes slightly better advantage of compute/bitter lesson. This is because o1 feels like it was a bit of an experiment, while o3 probably got "full-scale" compute resources.

In other words, I suspect o3's base model is 4o+ (that is, 4o fine-tuned with some o1 distillation). If this view is correct, it has startling consequences for near-time scaling. Once the reasoning paradigm is plugged into GPT-5, we'll have big problems.

[-]Vladimir_Nesov1mo40

DeepSeek-R1 ... Run RL to convergence

Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well.

o3-mini has the same knowledge cutoff date as 4o and o1 (late 2023)

It seems like o1-mini is its own thing, might even start with a base model that's unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So a clue about o3-mini data doesn't obviously transfer to o3.

if it used GPT-5 as a base model

The numbering in GPT-N series advances with roughly 100x in raw compute at a time. If original GPT-4 is 2e25 FLOPs, then a GPT-5 would need 2e27 FLOPs, and a 100K H100s training system (like the Microsoft/OpenAI system at the site near the Goodyear airport) can only get you 3e26 FLOPs or so (in BF16 in 3 months). The initial Stargate training system at Abilene site, after it gets 300K B200s, will be 7x stronger than that, so will be able to get 2e27 FLOPs. Thus I expect GPT-5 in 2026 if OpenAI keeps following the naming convention, while the new 100K H100s model this year will be GPT-4.5o or something like that.

[-]MiloSal1mo30

Thanks for your comments!

Not to convergence, the graphs in the paper keep going up.

On page 10, when describing the training process for R1, they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks." I refer to this.

I basically agree with your analysis of GPT-5--which is worrying for short-term scaling, as I tried to argue.

[-]Vladimir_Nesov1mo40

they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks."

Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn't even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) after 4K steps, then 0.67 (+0.12) after 8K steps, it just keeps going up by +0.10 every doubling in training steps. And the plots-that-don't-level-off in the o1 post are in log-steps. Also, the average number of reasoning steps for R1-Zero in Figure 3 is a straight line that's probably good for something if it goes further up. So I guess I might disagree with the authors even, in characterizing step 10K as "at convergence", though your quote is about R1 rather than R1-Zero for which there are plots in the paper...

your analysis of GPT-5--which is worrying for short-term scaling

Well, I mostly argued about naming, not facts, though the recent news seem to be suggesting that the facts are a bit better than I expected only a month ago, namely 1 GW training systems might only get built in 2026 rather than in 2025, except possibly at Google. And as a result even Google might feel less pressure to actually get this done in 2025.

[-]MiloSal1mo10

Another possibility is that only o3-mini has this knowledge cutoff and the full o3 has a later knowledge cutoff. This could happen if o3-mini is distilled into an older model (e.g., 4o-mini). If the full o3 turns out to have a knowledge cutoff later than 2023, I'd take that as convincing evidence 4o is not the base model.

Moderation Log