Will Brown: it's simple, really. GPT-4.1 is o3 without reasoning ... o1 is 4o with reasoning ... and o4 is GPT-4.5 with reasoning.
Price and knowledge cutoff for o3 strongly suggest it's indeed GPT-4.1 with reasoning. And so again we don't get to see the touted scaling of reasoning models, since the base model got upgraded instead of remaining unchanged. (I'm getting the impression that GPT-4.5 with reasoning is going to be called "GPT-5" rather than "o4", similarly to how Gemini 2.5 Pro is plausibly Gemini 2.0 Pro with reasoning.)
In any case, the fact that o3 is not GPT-4.5 with reasoning means that there is still no word on what GPT-4.5 with reasoning is capable of. For Anthropic, Sonnet 3.7 with reasoning is analogous to o1 (it's built on the base model of the older Sonnet 3.5, similarly to how o1 is built on the base model of GPT-4o). Internally, they probably already have a reasoning model for some larger Opus model (analogous to GPT-4.5) and for a newer Sonnet (analogous to GPT-4.1) with a newer base model different from that of Sonnet 3.5.
This also makes it less plausible that Gemini 2.5 Pro is based on a GPT-4.5 scale model (even though TPUs might've been able to make its price/speed possible even if it was), so there might be a Gemini 2.0 Ultra internally after all, at least as a base model. One of the new algorithmic secrets disclosed in Gemma 3 report was that pretraining knowledge distillation works even when the teacher model is much larger (rather than modestly larger) than the student model, it just needs to be trained for enough tokens for this to become an advantage rather than a disadvantage (Figure 8), something that for example Llama 3.2 from Sep 2024 still wasn't taking advantage of. This makes it useful to train the largest possible compute optimal base model regardless of whether its better quality justifies its inference cost, merely to make the smaller overtrained base models better by pretraining them from the large model logits with knowledge distillation instead of from raw tokens.
Does this match your understanding?
AI Company | Public/Preview Name | Hypothesized Base Model | Hypothesized Enhancement | Notes |
OpenAI | GPT-4o | GPT-4o | None (Baseline) | The starting point, multimodal model. |
OpenAI | o1 | GPT-4o | Reasoning | First reasoning model iteration, built on the GPT-4o base. Analogous to Anthropic's Sonnet 3.7 w/ Reasoning. |
OpenAI | GPT-4.1 | GPT-4.1 | None | An incremental upgrade to the base model beyond GPT-4o. |
OpenAI | o3 | GPT-4.1 | Reasoning | Price/cutoff suggest it uses the newer GPT-4.1 base, not GPT-4o + reasoning. |
OpenAI | GPT-4.5 | GPT-4.5 | None | A major base model upgrade |
OpenAI | GPT-5 | GPT-4.5 | Reasoning | "GPT-5" might be named this way, but technologically be GPT-4.5 + Reasoning. |
Anthropic | Sonnet 3.5 | Sonnet 3.5 | None | Existing model. |
Anthropic | Sonnet 3.7 w/ Reasoning | Sonnet 3.5 | Reasoning | Built on the older Sonnet 3.5 base, similar to how o1 was built on GPT-4o. |
Anthropic | N/A (Internal) | Newer Sonnet | None | Internal base model analogous to OpenAI's GPT-4.1. |
Anthropic | N/A (Internal) | Newer Sonnet | Reasoning | Internal reasoning model analogous to OpenAI's "o3". |
Anthropic | N/A (Internal) | Larger Opus | None | Internal base model analogous to OpenAI's GPT-4.5. |
Anthropic | N/A (Internal) | Larger Opus | Reasoning | Internal reasoning model analogous to hypothetical GPT-4.5 + Reasoning. |
N/A (Internal) | Gemini 2.0 Pro | None | Plausible base model for Gemini 2.5 Pro according to the author. | |
Gemini 2.5 Pro | Gemini 2.0 Pro | Reasoning | Author speculates it's likely Gemini 2.0 Pro + Reasoning, rather than being based on a GPT-4.5 scale model. | |
N/A (Internal) | Gemini 2.0 Ultra | None | Hypothesized very large internal base model. Might exist primarily for knowledge distillation (Gemma 3 insight). |
Where does o4-mini fall in this table? 4.1 + reasoning + enhancements, or 4.5 + reasoning - size?
I gotta say, I have no idea why people are putting Claude 3.7 in the same league as recent GPT models or Gemini 2.5. My experience is that Claude 3.7 deeply struggles with a range of tasks. I've been trying to use it for grant writing -- shortening text, defining terms in my field, suggesting alternative ways to word things. It gets definitions wrong, offers nonsensical alternative wordings, and gets stuck repeating the same "shortened," nuance-stripped text over and over despite me asking it to try another way.
By contrast, I threw an entire draft of my grant proposal into Gemini 2.5 and got a substantially shorter and more clear new version out, first try.
Interesting, my experience is roughly the opposite re Claude-3.7 vs the GPTs (no comment on Gemini, I've used it much less so far). Claude is my main workhorse; good at writing, good at coding, good at helping think things through. Anecdote: I had an interesting mini-research case yesterday ('What has Trump II done that liberals are likely to be happiest about?') where Claude did well albeit with some repetition and both o3 and o4-mini flopped. o3 was initially very skeptical that there was a second Trump term at all.
Hard to say if that's different prompting, different preferences, or even chance variation, though.
Gemini seems to do a better job of shortening text while maintaining the nuance I expect grant reviewers to demand. Claude seems to focus entirely on shortening text. For context, I'm feeding a specific aims page for my PhD work that I've written about 15 drafts of already, so I have detailed implicit preferences about what is and is not an acceptable result.
Yesterday’s news alert, nevertheless: The verdict is in. GPT-4.1-Mini in particular is an excellent practical model, offering strong performance at a good price. The full GPT-4.1 is an upgrade to OpenAI’s more expensive API offerings, it is modestly better but costs 5x as much. Both are worth considering for coding and various other API uses. If you have an agent or other app, it’s at least worth trying plugging these in and seeing how they do.
This post does not cover OpenAI’s new reasoning models. That was today’s announcement, which will be covered in full in a few days, once we know more.
Introducing GPT-4.1, 4.1-mini and 4.1-nano
That’s right, 4.1.
Here is their livestream, in case you aren’t like me and want to watch it.
On the one hand, I love that they might finally use a real version number with 4.1.
On the other hand, we would now have a GPT-4.1 that is being released after they previously released a GPT-4.5. The whole point of version numbers is to go in order.
The new cheat sheet for when to use GPT-4.1:
I mean, I think that’s wrong, but I’m not confident I have the right version of it.
They are not putting GPT-4.1 in ChatGPT, only in the API. I don’t understand why.
The best news is, Our Price Cheap, combined with the 1M token context window and max output of 32k tokens.
Based on the benchmarks and the reports elsewhere, the real release here is GPT-4.1-mini. Mini is 20% of the cost for most of the value. The full GPT-4.1 looks to be in a weird spot, where you probably want to either go big or go small. Nano might have its uses too, but involves real tradeoffs.
On Your Marks
We start with the official ones.
They lead with coding, SWE-bench in particular.
I almost admire them saying no, we don’t acknowledge that other labs exist.
They have an internal ‘instruction following’ eval. Here the full GPT-4.1 is only okay, but mini and nano are upgrades within the OpenAI ecosystem. It’s their benchmark, so it’s impossible to know if these scores are good or not.
Next up is MultiChallenge.
This is an outside benchmark, so we can see that these results are mid. Gemini 2.5 Pro leads the way with 51.9, followed by Claude 3.7 Thinking. GPT-4.5 is the best non-thinking model, with various Sonnets close behind.
They check IFEval and get 87%, which is okay probably, o3-mini-high is 94%. The mini version gets 84%, so the pattern of ‘4.1 does okay but 4.1-mini only does slightly worse’ continues.
All three model sizes have mastered needle-in-a-haystack all the way to 1M tokens. That’s great, but doesn’t tell you if they’re actually good in practice in long context.
Then they check something called Graphwalks, then MMMU, MathVista, CharXiv-Reasoning and Video long context.
Their charts are super helpful, check ‘em out:
Other Benchmarks
Mostly things have been quiet, but for those results we have it is clear that GPT-4.1 is a very good value, and a clear improvement for most API use over previous OpenAI models.
Where we do have reports, we continue to see the pattern that OpenAI’s official statistics report. Not only does GPT-4.1-mini not sacrifice much performance versus GPT-4.1, in some cases the mini version is actively better.
We see this for EpochAI’s tests, and also for WeirdML.
Huh, I hadn’t previously seen these strong math results for Grok 3.
Artificial Analysis confirms OpenAI’s claims with its ‘intelligence index’ and other measures (their website is here, the quotes are from their thread):
There are obvious reasons to be skeptical of this index, I mean Gemini Flash 2.0 is not as smart as Claude 3.7 Sonnet, but it’s measuring something real. It illustrates that GPT-4.1 is kind of expensive for what you get, whereas GPT-4.1-mini is where it is at.
Reactions
This is the kind of thing people who try to keep up say these days:
Then you have the normal sounding responses, also positive.
I think mostly doing unprompted tables is good.
Here is a bold but biased claim.
And here’s a bold censorship claim and a counterclaim, the only words I’ve heard on the subject. For coding and similar purposes no one seems to be having similar issues.
Justice for GPT-4.5
OpenAI has announced the scheduled deprecation of API access for GPT-4.5. So GPT-4.5 will be ChatGPT only, and GPT-4.1 will be API only.
When I heard it was a full deprecation of GPT-4.5 I was very sad. Now that I know it is staying in ChatGPT, I think this is reasonable. GPT-4.5 is too expensive to scale API use while GPUs are melting, except if a rival is trying to distill its outputs. Why help them do that?
Safety Third
This space intentionally left blank.
As in, I could find zero mention of OpenAI discussing any safety concerns whatsoever related to GPT-4.1, in any way, shape or form. It’s simply, hey, here’s a model, use it.
For GPT-4.1 in particular, for all practical purposes, This Is Fine. There’s very little marginal risk in this room given what else has already been released. Everyone doing safety testing is presumably and understandably scrambling to look at o3 and o4-mini.
I assume. But, I don’t know.
Improved speed and cost can cause what are effectively new risks, by tipping actions into the practical or profitable zone. Quantity can have a quality all its own. Also, we don’t know that the safeguards OpenAI applied to its other models have also been applied successfully to GPT-4.1, or that it is hitting their previous standards on this.
I mean, again, I assume. But, I don’t know.
I also hate the precedent this sets. That they did not even see fit to give us a one sentence update that ‘we have run all our safety tests and procedures, and find GPT-4.1 performs well on all safety metrics and poses no marginal risks.’
We used to have this principle where, when OpenAI or other frontier labs release plausibly frontier models, we get a model card and a full report on what precautions have been taken. Also, we used to have a principle that they took real and actually costly precautions.
Those days seem to be over. Shame. Also, ut oh.