This is not o3; it is what they'd internally called Orion, a larger non-reasoning model.

They say this is their last fully non-reasoning model, but that research on both types will continue.

They say it's currently limited to Pro users, but the model hasn't yet shown up on the chooser (edit: it is available in the app). They say it will be shared with Plus and Enterprise users next week.

It claims to be more accurate at standard questions and with a lower hallucination rate than any previous OAI model (and presumably any others).

"Alignment" was done by both supervised fine-tuning from an unspecified dataset, and RLHF  (this really only training refusals, which is pretty different from alignment in the classical sense, but could potentially help with real alignment if it's used that way - see System 2 Alignment).

The main claims are better world knowledge, better understanding of human intentions (it is modestly but distinctly preferred over 4o in their tests), and being better at writing. This suggests to me that their recent stealth upgrades of 4o might've been this model.

It does web searching and uses Canvas, and handles images.

Here's the start of the system card:

OpenAI GPT-4.5 System Card

OpenAI
February 27, 2025

1 Introduction

We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet. Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models. We trained it using new supervision techniques combined with traditional methods like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), similar to those used for GPT-4o. We conducted extensive safety evaluations prior to deployment and did not find any significant increase in safety risk compared to existing models.

Early testing shows that interacting with GPT-4.5 feels more natural. Its broader knowledge base, stronger alignment with user intent, and improved emotional intelligence make it well-suited for tasks like writing, programming, and solving practical problems—with fewer hallucinations. We’re sharing GPT-4.5 as a research preview to better understand its strengths and limitations. We’re still exploring its capabilities and are eager to see how people use it in ways we might not have expected.

This system card outlines how we built and trained GPT-4.5, evaluated its capabilities, and strengthened safety, following OpenAI’s safety process and Preparedness Framework.

2 Model data and training

Pushing the frontier of unsupervised learning

We advance AI capabilities by scaling two paradigms: unsupervised learning and chain-of-thought reasoning. Scaling chain-of-thought reasoning teaches models to think before they respond, allowing them to tackle complex STEM or logic problems. In contrast, scaling unsupervised learning increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is our next step in scaling the unsupervised learning paradigm.

New alignment techniques lead to better human collaboration

As we scale our models, and they solve broader, more complex problems, it becomes increasingly important to teach them a greater understanding of human needs and intent. For GPT-4.5, we developed new, scalable alignment techniques that enable training larger and more powerful models with data derived from smaller models. These techniques allowed us to improve GPT-4.5’s steerability, understanding of nuance, and natural conversation.

Internal testers report GPT-4.5 is warm, intuitive, and natural. When tasked with emotionally charged queries, it knows when to offer advice, diffuse frustration, or simply listen to the user. GPT-4.5 also shows stronger aesthetic intuition and creativity. It excels at helping users with their creative writing and design.

GPT-4.5 was pre-trained and post-trained on diverse datasets, including a mix of publicly available data, proprietary data from data partnerships, and custom datasets developed in-house, which collectively contribute to the model’s robust conversational capabilities and world knowledge.

Safety is limited to refusals, notably including refusals for medical or legal advice. Have they deliberately restricted those abilities to avoid lawsuits or to limit public perceptions of expertise being overtaken rapidly by AI?

They report no real change from previous safety evaluations, which seems reasonable as far as it goes. We're not to the really scary models yet, although it will be interesting to see if this produces noticably better tool-use and the type of recursive self-checking that's crucial for powering competent agents. They say it has those, and improved planning and "execution":

Based on early testing, developers may find GPT‑4.5 particularly useful for applications that benefit from its higher emotional intelligence and creativity—such as writing help, communication, learning, coaching, and brainstorming. It also shows strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.

They also say it's compute intensive, so not a replacement for 4o. This could be why they hadn't released Orion earlier. I wonder if this release is in response to Claude 3.7 taking top spots for most non-reasoning-appropriate tasks.

GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.

New Comment
12 comments, sorted by Click to highlight new comments since:

GPT-4.5 API price for input tokens is $75/1M to GPT-4o's $2.5/1M, and knowledge cutoff is Oct 2023, the same for both. This strongly suggests that GPT-4.5 is compute optimal while GPT-4o is overtrained[1], so there might be a future overtrained variant of GPT-4.5 in the works.

The mystery of the base model for o3 remains unresolved, but unchanged knowledge cutoff negates one of the arguments against it being GPT-4.5[2], while the plot of ARC-AGI success rate depending on the size of the problem continues to argue in favor (o3 maintains good performance for much larger puzzles than o1 does).


  1. If GPT-4o is a 10x overtrained 1e26 FLOPs model (a natural scale for 30K H100s), its compute optimal variant with the same compute would need 3x fewer tokens and 3x more active parameters. So if GPT-4.5 is a 3e26 FLOPs compute optimal model (a natural scale for 100K H100s), it needs 1.7x more parameters and data than a compute optimal 1e26 FLOPs model, or 5x more parameters and 1.7x less data than a 10x overtrained 1e26 FLOPs model. So the cost for input tokens should be 5x more, but it needs even less data, so could just reuse the better parts of the same older dataset (explaining the unchanged knowledge cutoff).

    The actual difference in price is 30x, but GPT-4o might be overtrained even more than 10x, and up to 4x out of the 30x could well be margin (higher for GPT-4.5) and optimization (better for GPT-4o), with uncertainty about demand explaining the rest (it's more expensive to serve per token if there aren't enough users). ↩︎

  2. A recent OpenAI paper stated that

    For our test set we use “Division 1” contests from late 2023 and 2024, all of which occurred after the o3 training set data cut-off.

    This weakly suggested that GPT-4.5 might be too new to be the base model for o3. But we now know its dataset is not that new after all. ↩︎

I don't think GPT-4o was trained on 1e26 FLOP or particularly close to it. Overtraining is common but GPT-4o being overtrained by 10x for 1e26 FLOP is kind of a strong and surprising claim (some models like Llama 3 8b are extremely overtrained but they're small so this overtraining is cheap). I think a more natural explanation is that it improves on GPT-4 because of superior post-training and other innovations.

There is a report that OpenAI might've been intending to spend $3bn on training in 2024 (presumably mostly for many smaller research experiments), and a claim that the Goodyear site has 3 buildings hosting 100K H100s. One of these buildings is 32K H100s, which at 40% utilization in 3 months produces 1e26 FLOPs (in BF16), which in GPU-time at $2/hour costs $140m. So it seems plausible that Azure already had one of these (or identical) datacenter buildings when GPT-4o was ready to train, and that $140m wasn't too much for a flagship model that carries the brand for another year.

With this amount of compute and the price of $2.5 per 1M input tokens, it's unlikely to be compute optimal. For MoEs at 1e26 FLOPs, it might be compute optimal to have 120-240 tokens/parameter (for 1:8-1:32 sparsity), which is 370B active parameters for a 1:8 sparse MoE or 260B for a 1:32 sparse MoE. Dense Llama-3-405B was $5 per 1M input tokens at probably slimmer margins, so GPT-4o needs to be more like 100B active parameters. Thus 3x less parameters than optimal and 3x more data than optimal (about 45T-65T 135T-190T trained-on tokens, which is reasonable as 3-4 5 epochs of 15T-20T 25T-40T unique tokens), giving 10x overtraining in the value of tokens/parameter compared to compute optimal.

The penalty from 10x overtraining is a compute multiplier of about 0.5x, so a 5e25 FLOPs compute optimal model would have similar performance, but it would have 2x more active parameters than a 10x overtrained 1e26 FLOPs model, which at $70m difference in cost of training should more than pay for itself.

65T tokens doesn't get you to 1e26 FLOP with 100B active params? You'd need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.

GPT-4.5 being trained on fewer tokens than GPT-4o doesn't really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn't quite make sense either, though I'm not as confident that's wrong.

1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to make your flagship model good, but not worth it if it barely has any effect on your flagship model. I don't think it's a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.

1e26 FLOP would have had a significant opportunity cost.

At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn't seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It's difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would've been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a less overtrained model that costs more in inference, and so requires more GPUs to serve for inference).

I don't think it's a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.

The main anchors are 32K H100s in a single training system, and frontier training compute scaling 4x per year. Currently, a year later, 3e26-6e26 FLOPs models are getting released (based on 100K H100s in Colossus and numbers in the Grok 3 announcement, 100K H100s at Goodyear site, 100K TPUv6e datacenters, Meta's 128K H100s). The $3bn figure was just to point out that $140m following from such anchors is not a very large number.

65T tokens doesn't get you to 1e26 FLOP with 100B active params?

Right, 45T-65T is for a compute optimal 1e26 model, I did the wrong calculation when editing in this detail. For a 10x overtrained model, it's 3x more data than that, so for 150T total tokens you'd need 5 epochs of 30T tokens, which is still feasible (with almost no degradation compared to 150T unique tokens of that quality). The aim was to calculate this from 260B and 370B reduced 3x (rather than from 100B).

GPT-4.5 being trained on fewer tokens than GPT-4o doesn't really make sense.

How so? If it uses 3x more compute but isn't 10x overtrained, that means less data (with multiple epochs, it would probably use exactly the same unique data, repeated a bit less). The video presentation on GPT-4.5 mentioned work on lower precision in pretraining, so it might even be a 6e26 FLOPs model (though a priori it would be surprising if the first foray into this scale isn't taken at the more conservative BF16). And it would still be less data (square root of 6x is less than 3x). Overtraining has a large effect on both the number of active parameters and the needed number of tokens, at a relatively minor cost in effective compute, thus it's a very salient thing for use in production models.

It's better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it's an expensive and huge model, I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.

I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.

Possibly, but 1) There are reasoning models, 2) Value per token may still raise faster than cost per token for non-reasoning models which could be enough to sustain progress, and 3) It's possible that a more expensive non-reasoning model makes reasoning more efficient and/or effective by increasing the quality and complexity of each reasoning step.

 

At this point I pretty much never use 4o for anything. It's o1, o1-pro, or o3-mini-high. Looking forward to testing 4.5 though.

>Safety is limited to refusals, notably including refusals for medical or legal advice. Have they deliberately restricted those abilities to avoid lawsuits or to limit public perceptions of expertise being overtaken rapidly by AI? 

I think it's been well over a year since I've had an issue with getting an LLM to give me medical advice, including GPT-4o and other SOTA models like Claude 3.5/7, Grok 3 and Gemini 2.0 Pro. I seem to recall that the original GPT-4 would occasionally refuse, but could be coaxed into it. 

I am a doctor, and I tend to include that information either in model memory or in a prompt (mostly to encourage the LLM to assume background knowledge and ability to interpret facts). Even without it, my impression is that most models simply append a "consult a human doctor" boilerplate disclaimer instead of refusing. 

I would be rather annoyed if GPT 4.5 was a reversion in that regard, as I find LLMs immensely useful for quick checks on topics I'm personally unfamiliar with (and while hallucinations happen, they're quite rare now, especially with search, reasoning and grounding). I don't think OAI or other AI companies have faced any significant amount of litigation from either people who received bad advice, or doctors afraid of losing a job. 

I'm curious about whether anyone has had any issues in that regard, though I'd expect not. 

it is what they'd internally called Orion, a larger non-reasoning model.

 

What's the source for this?

If true, giving the model a special name like "Orion" seems to suggest that OA had bigger hopes for it than just a half-successor to GPT-4.

From Sam Altman - https://x.com/sama/status/1889755723078443244

"We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model."

They say this is their last fully non-reasoning model, but that research on both types will continue. 

No, they said that GPT4.5 and GPT5 will be their last non-reasoning models.

They say it's currently limited to Pro users,

Meh, it's coming to plus users in ~a week.


It claims to be more accurate at standard questions and with a lower hallucination rate than any previous OAI model (and presumably any others). 

I think this is a big point and a better world knowledge is going to prove tremendously useful when it comes to applying RL to base models and a lower hallucination rate leads to effective exploration of the reasoning space + a better dataset after rejection sampling. Which should lead to lots of gains over models trained with RL over 4o. 

Not to speak about alignment - but looks like a big W for OpenAI - especially if they're going to raise in the near future (<6 months).

Curated and popular this week