All of MiloSal's Comments + Replies

MiloSal30

Thanks for your comments!

Not to convergence, the graphs in the paper keep going up.

On page 10, when describing the training process for R1, they write: "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks." I refer to this. 

I basically agree with your analysis of GPT-5--which is worrying for short-term scaling, as I tried to argue.

4Vladimir_Nesov
Ah, I failed to take a note of that when reading the paper. My takeaway was the opposite. In Figure 2 for R1-Zero, the first impression is convergence, both from saturation of the benchmark, and in the graph apparently leveling off. But if replotted in log-steps instead of linear steps, there isn't even any leveling off for pass@1, despite near-saturation of the benchmark for cons@16: accuracy for pass@1 is 0.45 after 2K steps, 0.55 (+0.10) after 4K steps, then 0.67 (+0.12) after 8K steps, it just keeps going up by +0.10 every doubling in training steps. And the plots-that-don't-level-off in the o1 post are in log-steps. Also, the average number of reasoning steps for R1-Zero in Figure 3 is a straight line that's probably good for something if it goes further up. So I guess I might disagree with the authors even, in characterizing step 10K as "at convergence", though your quote is about R1 rather than R1-Zero for which there are plots in the paper... Well, I mostly argued about naming, not facts, though the recent news seem to be suggesting that the facts are a bit better than I expected only a month ago, namely 1 GW training systems might only get built in 2026 rather than in 2025, except possibly at Google. And as a result even Google might feel less pressure to actually get this done in 2025.
MiloSal10

Another possibility is that only o3-mini has this knowledge cutoff and the full o3 has a later knowledge cutoff. This could happen if o3-mini is distilled into an older model (e.g., 4o-mini). If the full o3 turns out to have a knowledge cutoff later than 2023, I'd take that as convincing evidence 4o is not the base model. 

MiloSal30

What is o3's base model?

To create DeepSeek-R1, they:

  1. Start with DeepSeek-V3-Base as a base model
  2. Fine-tune base model on synthetic long CoT problem solving examples
  3. Run RL to convergence on challenging verifiable math/coding/etc. problems, with reward for (a)  formatting and (b) correctness

Therefore, I roughly expect o1's training process was:

  1. Start with 4o as a base model
  2. Some sort of SFT on problem solving examples
  3. Run RL on verifiable problems with some similar reward setup.

An important question for the near-term scaling picture is whether o3 uses 4o as ... (read more)

4Vladimir_Nesov
Not to convergence, the graphs in the paper keep going up. Which across the analogy might explain some of the change from o1 to o3 (the graphs in the o1 post also keep going up), though new graders coded for additional verifiable problems are no doubt a large part of it as well. It seems like o1-mini is its own thing, might even start with a base model that's unrelated to GPT-4o-mini (it might be using its own specialized pretraining data mix). So a clue about o3-mini data doesn't obviously transfer to o3. The numbering in GPT-N series advances with roughly 100x in raw compute at a time. If original GPT-4 is 2e25 FLOPs, then a GPT-5 would need 2e27 FLOPs, and a 100K H100s training system (like the Microsoft/OpenAI system at the site near the Goodyear airport) can only get you 3e26 FLOPs or so (in BF16 in 3 months). The initial Stargate training system at Abilene site, after it gets 300K B200s, will be 7x stronger than that, so will be able to get 2e27 FLOPs. Thus I expect GPT-5 in 2026 if OpenAI keeps following the naming convention, while the new 100K H100s model this year will be GPT-4.5o or something like that.
1MiloSal
Another possibility is that only o3-mini has this knowledge cutoff and the full o3 has a later knowledge cutoff. This could happen if o3-mini is distilled into an older model (e.g., 4o-mini). If the full o3 turns out to have a knowledge cutoff later than 2023, I'd take that as convincing evidence 4o is not the base model. 
MiloSal-30

This is really cool research! I look forward to seeing what you do in future. I think you should consider running human baselines, if that becomes possible in the future. Those help me reason about and communicate timelines and takeoff a lot.

1Håvard Tveit Ihle
Thank you! It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours. I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one. What we would really want is to have several top researchers/ml engineers do it, and I know that METR is working on that, so that is probably the best source we have for a realistic comparison at the moment.
MiloSal63

Great post! Glad to see more discussion of the implications of short timelines on impactful work prioritization on LW.


These last two categories—influencing policy discussions and introducing research agendas—rely on social diffusion of ideas, and this takes time. With shorter timelines in mind, this only make sense if your work can actually shape what other researchers do before AI capabilities advance significantly. 

Arguably this is not just true of those two avenues for impactful work, but rather all avenues. If your work doesn't cause someone in a ... (read more)

MiloSal10

I'm fairly confident that this would be better than the current situation, and primarily because of something that others haven't touched on here.

The reason is that, regardless of who develops them, the first (militarily and economically) transformative AIs will cause extreme geopolitical tension and instability that is challenging to resolve safely. Resolving such a situation safely requires a well-planned off-ramp, which must route through extremely major national- or international-level decisions. Only governments are equipped to make decisions like the... (read more)

MiloSal63

Akash, your comment raises the good point that a short-timelines plan that doesn't realize governments are a really important lever here is missing a lot of opportunities for safety. Another piece of the puzzle that comes out when you consider what governance measures we'd want to include in the short timelines plan is the "off-ramps problem" that's sort of touched on in this post

Basically, our short timelines plan needs to also include measures (mostly governance/policy, though also technical) that get us to a desirable off-ramp from geopolitical t... (read more)

MiloSal21

I think it is much less clear that pluralism is good than you portray. I would not, for example, want other weapons of mass destruction to be pluralized. 

1rosehadshar
I also don't want that! I think something more like: * Pluralism is good for reducing power concentration, and maybe for AI safety (as you get more shots on goal) * There are probably some technologies that you really don't want widely shared though * The question is whether it's possible to restrict these technologies via regulation and infosecurity, without restricting the number of projects or access to other safe technologies Note also that it's not clear what the offence-defence balance will be like. Maybe we will be lucky, and defence-dominant tech will get developed first. Maybe we will get unlucky, and need to restrict offense-dominant tech (either until we develop defensive tech, or permanently). We need to be prepared for both eventualities, but it's not yet clear how big a problem this will end up being.