I’m really curious what people’s theories are on why openai released this and not o3?
My old main theory was that they would have to charge so much for o3 that it would create bad PR, but this is now much less likely.
My first remaining guess is that they don’t want competitors extracting full o3 reasoning traces to train on. I guess its also possible that o3 is just dangerous. On the other side of capabilities, its technically possible that o3 is benchmark gamed so hard that its outputs are not usable.
My model was just that o3 was undergoing safety evals still, and quite plausibly running into some issues with the preparedness framework. My model of OpenAI Preparedness (epistemic status: anecdata+vibes) is that they are not Prepared for the hard things as we scale to ASI, but they are relatively competent at implementing the preparedness framework and slowing down releases if there are issues. It seems intuitively plausible that it's possible to badly jailbreak o3 into doing dangerous things in the "high" risk category.
AFAIK the only info about the (lack of) release of o3 comes from this tweet - and it does seem like o3 will be available explicitly via the API.
I think it's what you say - pricing causing bad PR. It seems that even for pro users, if you use o1 pro too much they will silently switch you to using o3 mini without telling you even if you select o1 pro.
So I think they want to continue doing this while not looking so bad doing it.
The discourse around this model would benefit a lot from (a greater number of) specific examples where the GPT-4.5 response is markedly and interestingly different from the response of some reference model.
Karpathy's comparisons are a case in point (of the absence I'm referring to). Yes, people are vehemently disputing which responses were better, and whether the other side has "bad taste"... but if you didn't know what the context was, the most obvious property of the pairs would be how similar they are.
And how both options are bad (unfunny standup, unmetrical or childish poetry), and how they are both bad in basically the same way.
Contrast this with the GPT-3 and GPT-4 releases: in those cases people had no trouble finding many, many examples of obviously distinctive behavior from the new model, and these were rapidly and profusely shared in the usual venues.
As Karpathy says, with GPT-4 it was "subtler" than it had been before, at least in some sense. But the difference was not that there weren't any clear examples of better or different behavior – it was just that the cases where the new model behaved very differently tended to be obscure or tricky or otherwise "off the beaten path" somehow, so that if you weren't actively looking for them, the user experience could feel deceptively similar to the one we had with earlier models.
But we were actively looking for those special cases, and we had no trouble finding them.
For instance, looking through my blog archives, I find this thread from shortly after the GPT-4 release, highlighting some puzzle-like questions that GPT-3.5 failed and GPT-4 aced. Summing up the trend, I wrote:
Subjectively, I’ve found that GPT-4 feels much more “attentive” and harder to trick than GPT-3.5.
When I’ve seen it make errors, they usually involves things on the edges of its knowledge – topics that are either academically advanced, or just not very widely known.
[...]
These cases are kind of tricky to discover.
On the one hand, GPT-4 does know a lot of stuff, including obscure stuff – this was the first obvious difference I noticed from GPT-3.5, and I later saw I wasn’t alone in that.
So you have to hunt for things obscure enough that it won’t know them. But if you start asking for really obscure stuff, it will often telling you (whether rightly or wrongly) that it doesn’t know the answer.
There’s still a “wedge” of cases where it will start confidently blabbing about something it doesn’t really understand, but the wedge has gotten much narrower.
Maybe the "wedge" was already so small before GPT-4.5 that it's now simply very difficult to find anything that's still a part of it?
But I dunno, that just doesn't feel like the right explanation to me. For one thing, GPT-4.5 still gets a lot of (semi-)obscure-knowledge stuff wrong. (In one case I asked it about a piece of rationalist community trivia, and in the course of giving an inaccurate answer, it referred to "the Israeli blogger and activist Eliezer Yudkowsky"... like, come on, lmao.)
I'm open to the idea that this is no different from earlier scale-ups, mutatis mutandis – that it really is dramatically better in certain cases, like GPT-3 and 3.5 and 4 were, and those (perhaps obscure) cases simply haven't diffused across the community yet.
But all of this "taste" stuff, all of this stuff where people post bog-standard AI slop and claim it has ineffably better vibes, just feels like an accidental admission of defeat re: the original question. It was never like that with previous scale-ups; we didn't need "taste" then; in the cases that got highlighted, the difference was obvious.
(OTOH, if you look at two models that are differently scaled, but not "enough" – like just a 2x compute difference, say – typically it will be very hard to find unequivocal wins for the bigger model, with the latter winning at most in some vague aggregate vibes sense. One might then argue that this reflects something about the concave shape of the "log-compute vs. noticeable behavior" curve: 10x is the new 2x, and only with even more scale will we get something for which obvious wins are easy to evince.)
I think most of the trouble is conflating recent models like GPT-4o with GPT-4, when they are instead ~GPT-4.25. It's plausible that some already use 4x-5x compute of original GPT-4 (an H100 produces 3x compute of an A100), and that GPT-4.5 uses merely 3x-4x more compute than any of them. The distance between them and GPT-4.5 in raw compute might be quite small.
It shouldn't be at all difficult to find examples where GPT-4.5 is better than the actual original GPT-4 of March 2023, it's not going to be subtle. Before ChatGPT there were very few well-known models at each scale, but now the gaps are all filled in by numerous models of intermediate capability. It's the sorites paradox, not yet evidence of slowdown.
Is this actually the case? Not explicitly disagreeing, but just want to point out there is still a niche community who prefers using the oldest available 0314 gpt-4 checkpoint via API, which by the way is still almost the same price as 4.5, hardware improvements notwithstanding, and pretty much the only way to still get access to a model that presumably makes use of the full ~1.8 trillion parameters 4th-gen gpt was trained with.
Speaking of conflation, you see it everywhere in papers: somehow most people now entirely conflate gpt-4 with gpt-4 turbo, which replaced the full gpt-4 on chatgpt very quickly, and forget that there were many complaints back then that the faster (shrinking) model iterations were losing the "big model smell", despite climbing the benchmarks.
And so when lots of people seem to describe 4.5's advantages vs 4o as coming down to a "big model smell", I think it is important to remember 4turbo and later 4o are clearly optimized for speed, price and benchmarks far more than original release gpt-4 was, and comparisons on taste/aesthetics/intangibles may be more fitting when using the original, non-goodharted, full scale gpt-4 model. At the very least, it should fully and properly represent what it looks like to have a clean ~10x less training compute vs 4.5.
Hard disagree, this is evidence of slowdown.
As the model updates grow more dense I also check out; a large jump in capabilities between the original gpt-4 and gpt-4.5 would remain salient to me. This is not salient.
My other comment was bearish, but in the bullish direction, I'm surprised Zvi didn't include any of Gwern's threads, like this or this, which apropos of Karpathy's blind test I think have been the best clear examples of superior "taste" or quality from 4.5, and actually swapped my preferences on 4.5 vs 4o when I looked closer.
As text prediction becomes ever-more superhuman, I would actually expect improvements in many domains to become increasingly non-salient, as it takes ever increasing thoughtfulness / language nuance to appreciate the gains.
But back to bearishness, it is unclear to me how much this mode-collapse improvement could just be dominated by postraining improvements instead of the pretraining scaleup. And of course, to wonder how superhuman text prediction improvement will ever pragmatically alleviate the regime's weaknesses in the many known economical and benchmarked domains, especially if Q-Star fails to generalize much at scale, just like multimodality failed to generalize much at scale before it.
We are currently scaling super human predictors of textual, visual, and audio datasets. The datasets themselves, primarily composed of the internet plus increasingly synthetically varied copies, is so generalized and varied that this prediction ability, by default, cannot escape including human-like problem solving and other agentic behaviors, as Janus helped model with Simulacrums some time ago. But as they engorge themselves with increasingly opaque and superhuman heuristics towards that sole goal of predicting the next token, to expect that the intrinsically discovered methods will continue trending towards classically desired agentic and AGI-like behaviors seems naïve. The current convenient lack of a substantial gap between being good at predicting the internet and being good at figuring out a generalized problem will probably dissipate, and Goodhart will rear it's nasty head as the ever-optimized-for objective diverges ever-further from the actual AGI goal.
As someone tasked with deciding what AI tools the company I work for should be using, and training people to use them, the version names and numbers have been tons of fun. "Deep Research, not DeepSeek. No the other one. No no, the other other one."
Although, today I did remind myself that (over a much longer timespan) the version names/numbers for Windows major releases have been 3.1, 95, NT, 98 Second Edition, 2000, ME, XP, Vista, 7, 8, 10, and 11. And also almost no founder should ever be allowed to name their company.
A measurable uptick in persuasive ability, combined with middling benchmark scores but a positive eval of "taste" and "aesthetics", should raise some eyebrows. I wonder how we can distinguish good (or the 'correct') output from output that is simply pleasant.
If you believe openai that their top priority is building superintelligence (pushing the edge of the envelope of what is possible with AI), then presumably this model was built under the thesis that it is an important step to making much smarter models.
One possible model of how people do their best thinking is that they learn/ focus in on the context they need, goal included, refining the context. Then they manage to synthesize a useful next step.
So doing a good job thinking involves successfully taking a series of useful thinking steps. Since you are bottlenecked on successive leaps of insight, getting the chance of an insight up even a little bit improves the prob bility of your success in a chain of thought - where the insight chance is multiplied by itself over and over - dramatically.
Better humor, less formulaic writing, etc are forms of insight. I expect gpt4.5 and 5 to supercharge the progress being made by thinking and runtime compute.
TLDR my reaction is I don’t really know how good these models are right now.
I felt exactly the same after the Claude 3.7 post.
But actually.. hasn't LiveBench solved the evals crisis?
It is specifically targeted a “subjective” and “cheating/hacking” problems.
It also cover a pretty broad set of capabilities.
It’s happening.
The question is, what is the it that is happening? An impressive progression of intelligence? An expensive, slow disappointment? Something else?
The evals we have available don’t help us that much here, even more than usual.
My tentative conclusion is it’s Secret Third Thing.
It’s a different form factor, with unique advantages, that is hard to describe precisely in words. It appears so far that GPT-4.5 has advantages in places like verbal intelligence, contextual adaptation, detailed knowledge, and a kind of abstract writing skill. It has better taste and aesthetics.
It is the first model I asked to help edit its own review, and it was (slightly) helpful.
It’s a different way to use a lot of compute to get more intelligence. When you need a reasoning model, or you need code written, do not call on GPT-4.5.
You need to carve out a new ‘place in your rotation’ for it.
Table of Contents
Introducing GPT-4.5
Altman tells us up front: It’s a different kind of intelligence.
The official announcement is here.
The System Card
We get to be brief on this one. In terms of preparedness framework risk, this is actively a step down from the reasoning models, with Cybersecurity and Model Autonomy scores going from Medium back down to Low.
The safety assessment was that there were no new risks in the room. I buy it.
The one thing I have a problem with is what Toby Ord highlighted here: The claim that GPT-4.5 is ‘not a frontier model’ despite being the largest trained by OpenAI.
This is despite saying it was ‘really at the frontier of unsupervised learning’ in their livestream, and it having obviously new capabilities. They then removed the sentence.
That claim matters because OpenAI has made safety-related commitments that apply only to frontier models.
Thus, I would like to see an official statement from OpenAI affirming that:
Otherwise, yeah, all seems straightforward enough.
The Apollo evaluation, where they say ‘pursue the goal at all costs’ did trigger some exfiltration attempts, as it often does:
Yes, as I discuss at that link, you should worry about a 2% exfiltration rate, even if it only happens via an extreme instruction.
I do think it’s fair to say that, given what we previously knew, this is not additionally concerning beyond what we should have already learned.
There are some signs in the tests of potential over-refusals, perhaps? But that’s it.
Hallucinations are relatively rare. It’s easy to jailbreak, but that’s fully expected.
New supervision techniques and alignment techniques, eh? I’m very curious.
GPT-4.5 did show noticeable steps up in persuasion, but nothing too worrisome.
In the hard capability areas that create danger, GPT-4.5 is a step down from Deep Research and o3.
The question is what this would mean if you incorporated GPT-4.5 into a new architecture that also included scaling inference time compute and knowing when to consult smaller models. That’s what they plan on calling (no this isn’t confusing at all! what are you talking about?) GPT-5.
Also, they included another reminder that OpenAI can only test on some important threats, such as radiological, nuclear and biological threats, can only be done properly with access to classified information. Which means you need the US AISI involved.
Our Price Not Cheap
This is a big model. It is priced accordingly. Indeed, it is so expensive to serve OpenAI mentions that it is not certain it will continue serving it via API at all. The obvious response is ‘then charge more’ but presumably they feel that means taking fire.
I mean sure that’s relatively a lot but also it’s eleven bucks. So it depends on use case.
Seriously, please, let us pay for the good stuff. If I don’t want it, I won’t pay. Fine.
The cost to serve the model plausibly means GPT-4.5 is actually rather old. There’s speculation it may have finished training in Summer 2024, was dismissed (likely the same way Opus 3.5 was) as not worth serving given the backlash to high prices and limited available compute, and was released now because of a combination of more available compute and the pressure from DeepSeek. That seems plausible, and the model card does have some things that make this seem more likely.
Pay Up You Cheapskate
Straight talk. Stop whining about the relative price. The absolute price is dirt cheap.
This was true for o1-pro and Deep Research and Sonnet, and it’s true for GPT-4.5.
If you’re talking to an LLM directly, or otherwise using the output as a person for real, then choose the best model for the job. If it costs $200/month, or $150 per million tokens, that is still approximately zero dollars. Consider what you get.
Consider what it would cost to get this amount of intelligence from a human. Pay up.
GPT-4.5 will often be the wrong tool for the job. It’s not a reasoning model. It’s not a coding model. It’s definitely not Deep Research. And no, it’s not the quantum leap you might have hoped for here.
But if it’s the right model for the job and you aren’t broke, what are you even doing.
While Supplies Last
OpenAI’s announcement of GPT-4.5 said they were considering not offering it in the API going forward. So it makes sense that a lot of people tried to prevent this.
I have seen enough that I do feel it would be a tragedy if OpenAI pulled GPT-4.5 without replacing it with another model that did similar things. But yes, fandom has long taught us that if you offer something cool and then threaten to take it away, there will be those stepping up to try and stop you.
Benchmarks
Sam Altman warned that GPT-4.5’s benchmarks will not reflect its capabilities, as it is focused on areas not picked up by benchmarks.
I want to be clear up front: This was not cope from Altman. He’s right. Benchmarks most definitely don’t tell the story here.
Despite that, it’s still worth noting the benchmarks.
SimpleQA is 62.5% vs. 47% for o1, 38.2% for 4o and 15% (!) for o3-mini.
Hallucination rate on SimpleQA is 37.1%, lower than the others.
GPT-4.5 is preferred over GPT-4o by human testers, but notice that the win rates are not actually that high – the effects are subtle. I’m curious to see this with an additional ‘about the same’ button, or otherwise excluding questions where GPT-4o is already essentially saturating the right answer.
The story they’re telling is that GPT-4.5 has higher EQ. That helps, but it does not consistently help. Many queries don’t care about EQ, and sometimes people are weird.
GPT-4.5 is very much not focused on coding, it still did well on Agentic Coding, although not as well as Sonnet 3.7.
It however did actively worse on SWE-Bench than the reasoning models, and vastly worse than Sonnet.
METR tests on their time horizon tasks, and finds GPT-4.5 falls below o1 and 3.6.
GPT-4.5 takes the top spot on WeirdML. A cool note is that Claude 3.7 here tends to use a lot of lines of code, and GPT-4.5 reliably uses relatively very little code. The code runs faster too. It does not try too hard.
A weird one is the ‘What is the least integer whose square is between 15 and 30’ test, which it seems GPT-4.5 has failed and where OpenAI models do consistently worse.
GPT-4.5 could in the future be used as the foundation of a reasoning model, which is plausibly the plan for GPT-5. If that happens, the result would be expensive to serve, perhaps prohibitively so, but could potentially show new capabilities. It is also possible that various scaffoldings could enable this without creating a reasoning model per se.
If one were to make a model like GPT-4.5 open weights, those would be big worries. Since GPT-4.5 is closed, we can count on OpenAI to take precautions in such cases.
Ask a silly question: Rs in strawberry (gets it wrong), 9.9>9.11 (gets it right).
The all-new Being Tyler Cowen benchmark, Sonnet 3.7 also gave a solid performance.
On the Being Zvi Mowshowitz benchmark, neither did as well, and I’m not sure which response was better, and I feel like I now better understand when I want 4.5 vs. 3.7.
Still no word on the Minecraft benchmark.
On the ‘outline a 50 chapter book’ benchmark Eleanor Konik reports it falls short where o1-pro does well. It makes sense that would effectively be a reasoning task rather than a writing task, so you’d do the outline with a reasoning model, then the actual text with GPT-4.5?
We Need Better Evaluations
So what do we do about the eval situation?
If you want an approximation, we can still get that. Beyond that, it’s getting harder.
Vibe checks are going off low sample sizes, are not systematic and require trust in the evaluator, and run into Feynman’s problem that you must avoid fooling yourself and you are the easiest one to fool. Plus people have no taste and get distracted by the shiny and the framing.
The risk with ‘taste’ is that it becomes mostly self-referential, it is that which people with taste prefer. That doesn’t help. There is however a real thing, that is highly correlated with taste, that is indeed, like the work, mysterious and important.
Part of the problem is there is not a fully ‘better’ versus ‘worse’ in general. In some cases yes you can say this, a sufficiently big gap will dominate everything the way humans are simply smarter than monkeys and ASIs will be simply smarter than humans, but there’s a reasonable range between different AIs right now where you cannot do this.
I can sort of think about how to do an eval to capture GPT-4.5’s advantages, but it’s going to involve some shenanigans and I don’t know how to protect against being gamed if people know too much or use it during training. This seems really hard.
What you can do is a holistic evaluation that combines all these sources, where you are Actually Looking at the details of what you see. Picking up on particular little things, especially when they were previously out of distribution. Tricky.
This type of automatic solution seems doomed:
Things are moving too fast. Benchmarks get saturated, different capabilities show up. Any systematic evaluation is going to lose relevance quickly. Arena is mostly useless now but what is surprising is how well it held up for how long before being gamed, especially given how little taste people have.
Positive Reactions
He then lists examples, where 4.5’s is clearly better than 4’s, but it’s not like 4.5’s answer was actively good or anything.
The biggest fan so far is Tyler Cowen, which makes sense.
I mostly don’t think this is cope. I think this is someone with a very different view of the production function than yours. The same things driving him to think travel to Manhattan is more important than living in Manhattan is making him highly value a model with better aesthetics.
Where I definitely disagree with him is in the idea that the model is only attempting to improve on the aesthetic dimension. I have no doubt OpenAI had much higher hopes for what GPT-4.5 would bring us, and were absolutely attempting to improve along all dimensions at once. That doesn’t take away the value of the aesthetics.
The ones who are high on 4.5 are mostly very confident they are right.
I think Aiden’s problems are largely a Skill Issue, especially the ads, but also real enough – I too have my traditional sources flooding the zone with political and Elon posts (although the ratio in my feeds is not kind to Elon) in ways that are hard to work around.
I note that while I think GPT-4.5 does have excellent taste, it is remarkable the extent to which those asserting how important this is have talked about it in… poor taste.
I am willing to believe that 4.5 has writing taste in a way other models don’t, for whatever that is worth.
Andrej Karpathy initially thought that 4→4.5 is roughly similar to 3.5→4, in that everything is 20% better, even if you can’t put your finger on a particular big leap, while noting it doesn’t have the advantages of the o1-style reasoning models.
He then illustrates outputs of 4 vs. 4.5 across five prompts.
Seriously, people have no taste, but then maybe neither do I and what is taste anyway? People got 4/5 of these actively wrong if you presume 4.5’s answers are better, and I agreed with the public on all but one of them so I still got 3/5 wrong, although the three mistakes were all ‘these are both bad and I guess this one is modestly less awful.’ I wasn’t trying to figure out who was 4.5 per se.
I checked with Claude, asking it to guess who wrote what, what it expected the public preferred, and also what it thought was better. And it was all pretty random on all counts. So yeah, this is actually a super disappointing result.
Sid Bharath did a more general version of this test, if you want to keep going with it.
How good is GPT-4.5 at writing?
That’s too many 9s at the end, but the 99th percentile claim is not crazy. Most people are quite terrible at writing, and even people who are ‘good at writing’ can be quite bad at some other types of writing. Let’s say that there’s a reason you have never seen me post any fiction, and it’s not philosophical.
There is consensus that 4.5 has a lot of ‘big model smell.’
Another feature is that as a huge model, GPT-4.5 knows more things.
Most of the time one does not need that extra knowledge, but when you need it you very much appreciate it.
Teortaxes notes that previously only Sonnet could do Base64 decoupling, and GPT-4.5 is the first model to surpass it.
Petter Strandmark reports it is better at understanding confusing images.
Josh finds GPT-4.5 extremely helpful when given better context, including things like considering life decisions.
Negative Reactions
Does it count as a reaction if it’s made before having actually looked at the outputs?
I love the timing on this, way too early to actually have an informed opinion.
The benchmarks, and general performance on non-aesthetic tasks, is clearly disappointing, in ways that should inform our expectations.
Others latch onto vibes very quickly and call it failure.
Was GPT-4.5 ‘too early’ in terms of spending this much training compute? How does this relate to the possibility it may have been trained during the summer?
Perhaps. It gives us different returns than reasoning does, the two sources of scaling bring largely distinct benefits, at least under current implementations.
It could also be the case that OpenAI didn’t do such a great job here. We’ve seen this with Grok 3, where xAI pumped a giant amount of compute in and got less than you would hope for out of it. Here it seems like OpenAI got more out of it in new ways, at the cost of it also being expensive and slow to serve.
I don’t think GPT-4.5 is undercooked so much as GPT-4o is overbaked? As in, GPT-4o is very optimized for practical purposes, but you lose something important when you do that, unless you are at least at Anthropic levels of skill. I like GPT-4.5 the way it is.
So, a funny story from one early conversation I had with 4.5:
So I put that into my custom instructions, we’ll see how much things change. I did have to ‘make some cuts’ to get that in under the 1500 character limit.
Those are all also a way of saying we should be very curious what Claude 4.0 brings.
Teortaxes notes that he would use Sonnet over GPT-4.5.
How much does taste matter? What does it take to make it matter?
The way this is phrased feels like it is responding to the bullying from the ‘you have no taste if you don’t like it’ crowd. There’s definitely something there but it’s not easy to make it work.
Dominik Lukes charts the evolution of his takes.
I definitely agree that the preview system does OpenAI no favors. Every time, there’s some slow boring video I can’t bring myself to watch. I tried this time and it was painful. Then a lot of people compared this to the Next Big Thing, because it’s GPT-4.5, and got disappointed.
Then there are those who are simply unimpressed.
Predictions
The Lighter Side
I will not be explaining.