This is not o3; it is what they'd internally called Orion, a larger non-reasoning model.
They say this is their last fully non-reasoning model, but that research on both types will continue.
They say it's currently limited to Pro users, but the model hasn't yet shown up on the chooser (edit: it is available in the app). They say it will be shared with Plus and Enterprise users next week.
It claims to be more accurate at standard questions and with a lower hallucination rate than any previous OAI model (and presumably any others).
"Alignment" was done by both supervised fine-tuning from an unspecified dataset, and RLHF (this really only training refusals, which is pretty different from alignment in the classical sense, but could potentially help with real alignment if it's used that way - see System 2 Alignment).
The main claims are better world knowledge, better understanding of human intentions (it is modestly but distinctly preferred over 4o in their tests), and being better at writing. This suggests to me that their recent stealth upgrades of 4o might've been this model.
It does web searching and uses Canvas, and handles images.
Here's the start of the system card:
OpenAI GPT-4.5 System Card
OpenAI
February 27, 2025
1 Introduction
We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet. Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models. We trained it using new supervision techniques combined with traditional methods like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), similar to those used for GPT-4o. We conducted extensive safety evaluations prior to deployment and did not find any significant increase in safety risk compared to existing models.
Early testing shows that interacting with GPT-4.5 feels more natural. Its broader knowledge base, stronger alignment with user intent, and improved emotional intelligence make it well-suited for tasks like writing, programming, and solving practical problems—with fewer hallucinations. We’re sharing GPT-4.5 as a research preview to better understand its strengths and limitations. We’re still exploring its capabilities and are eager to see how people use it in ways we might not have expected.
This system card outlines how we built and trained GPT-4.5, evaluated its capabilities, and strengthened safety, following OpenAI’s safety process and Preparedness Framework.
2 Model data and training
Pushing the frontier of unsupervised learning
We advance AI capabilities by scaling two paradigms: unsupervised learning and chain-of-thought reasoning. Scaling chain-of-thought reasoning teaches models to think before they respond, allowing them to tackle complex STEM or logic problems. In contrast, scaling unsupervised learning increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is our next step in scaling the unsupervised learning paradigm.
New alignment techniques lead to better human collaboration
As we scale our models, and they solve broader, more complex problems, it becomes increasingly important to teach them a greater understanding of human needs and intent. For GPT-4.5, we developed new, scalable alignment techniques that enable training larger and more powerful models with data derived from smaller models. These techniques allowed us to improve GPT-4.5’s steerability, understanding of nuance, and natural conversation.
Internal testers report GPT-4.5 is warm, intuitive, and natural. When tasked with emotionally charged queries, it knows when to offer advice, diffuse frustration, or simply listen to the user. GPT-4.5 also shows stronger aesthetic intuition and creativity. It excels at helping users with their creative writing and design.
GPT-4.5 was pre-trained and post-trained on diverse datasets, including a mix of publicly available data, proprietary data from data partnerships, and custom datasets developed in-house, which collectively contribute to the model’s robust conversational capabilities and world knowledge.
Safety is limited to refusals, notably including refusals for medical or legal advice. Have they deliberately restricted those abilities to avoid lawsuits or to limit public perceptions of expertise being overtaken rapidly by AI?
They report no real change from previous safety evaluations, which seems reasonable as far as it goes. We're not to the really scary models yet, although it will be interesting to see if this produces noticably better tool-use and the type of recursive self-checking that's crucial for powering competent agents. They say it has those, and improved planning and "execution":
Based on early testing, developers may find GPT‑4.5 particularly useful for applications that benefit from its higher emotional intelligence and creativity—such as writing help, communication, learning, coaching, and brainstorming. It also shows strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.
They also say it's compute intensive, so not a replacement for 4o. This could be why they hadn't released Orion earlier. I wonder if this release is in response to Claude 3.7 taking top spots for most non-reasoning-appropriate tasks.
GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.
GPT-4.5 API price for input tokens is $75/1M to GPT-4o's $2.5/1M, and knowledge cutoff is Oct 2023, the same for both. This strongly suggests that GPT-4.5 is compute optimal while GPT-4o is overtrained[1], so there might be a future overtrained variant of GPT-4.5 in the works.
The mystery of the base model for o3 remains unresolved, but unchanged knowledge cutoff negates one of the arguments against it being GPT-4.5[2], while the plot of ARC-AGI success rate depending on the size of the problem continues to argue in favor (o3 maintains good performance for much larger puzzles than o1 does).
If GPT-4o is a 10x overtrained 1e26 FLOPs model (a natural scale for 30K H100s), its compute optimal variant with the same compute would need 3x fewer tokens and 3x more active parameters. So if GPT-4.5 is a 3e26 FLOPs compute optimal model (a natural scale for 100K H100s), it needs 1.7x more parameters and data than a compute optimal 1e26 FLOPs model, or 5x more parameters and 1.7x less data than a 10x overtrained 1e26 FLOPs model. So the cost for input tokens should be 5x more, but it needs even less data, so could just reuse the better parts of the same older dataset (explaining the unchanged knowledge cutoff).
The actual difference in price is 30x, but GPT-4o might be overtrained even more than 10x, and up to 4x out of the 30x could well be margin (higher for GPT-4.5) and optimization (better for GPT-4o), with uncertainty about demand explaining the rest (it's more expensive to serve per token if there aren't enough users). ↩︎
A recent OpenAI paper stated that
This weakly suggested that GPT-4.5 might be too new to be the base model for o3. But we now know its dataset is not that new after all. ↩︎
At the end of 2023 Microsoft had 150K+ H100s, so reserving 30K doesn't seem like too much (especially as they can use non-H100 and possibly non-Microsoft compute for research experiments). It's difficult to get a lot of a new chip when it just comes out, or to get a lot in a single training system, or to suddenly get much more if demand surges. But for a frontier training run, there would've been months of notice. And the opportunity cost of not doing this is being left with an inferior model (or a... (read more)