This is not o3; it is what they'd internally called Orion, a larger non-reasoning model.
They say this is their last fully non-reasoning model, but that research on both types will continue.
They say it's currently limited to Pro users, but the model hasn't yet shown up on the chooser (edit: it is available in the app). They say it will be shared with Plus and Enterprise users next week.
It claims to be more accurate at standard questions and with a lower hallucination rate than any previous OAI model (and presumably any others).
"Alignment" was done by both supervised fine-tuning from an unspecified dataset, and RLHF (this really only training refusals, which is pretty different from alignment in the classical sense, but could potentially help with real alignment if it's used that way - see System 2 Alignment).
The main claims are better world knowledge, better understanding of human intentions (it is modestly but distinctly preferred over 4o in their tests), and being better at writing. This suggests to me that their recent stealth upgrades of 4o might've been this model.
It does web searching and uses Canvas, and handles images.
Here's the start of the system card:
OpenAI GPT-4.5 System Card
OpenAI
February 27, 2025
1 Introduction
We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet. Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models. We trained it using new supervision techniques combined with traditional methods like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), similar to those used for GPT-4o. We conducted extensive safety evaluations prior to deployment and did not find any significant increase in safety risk compared to existing models.
Early testing shows that interacting with GPT-4.5 feels more natural. Its broader knowledge base, stronger alignment with user intent, and improved emotional intelligence make it well-suited for tasks like writing, programming, and solving practical problems—with fewer hallucinations. We’re sharing GPT-4.5 as a research preview to better understand its strengths and limitations. We’re still exploring its capabilities and are eager to see how people use it in ways we might not have expected.
This system card outlines how we built and trained GPT-4.5, evaluated its capabilities, and strengthened safety, following OpenAI’s safety process and Preparedness Framework.
2 Model data and training
Pushing the frontier of unsupervised learning
We advance AI capabilities by scaling two paradigms: unsupervised learning and chain-of-thought reasoning. Scaling chain-of-thought reasoning teaches models to think before they respond, allowing them to tackle complex STEM or logic problems. In contrast, scaling unsupervised learning increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is our next step in scaling the unsupervised learning paradigm.
New alignment techniques lead to better human collaboration
As we scale our models, and they solve broader, more complex problems, it becomes increasingly important to teach them a greater understanding of human needs and intent. For GPT-4.5, we developed new, scalable alignment techniques that enable training larger and more powerful models with data derived from smaller models. These techniques allowed us to improve GPT-4.5’s steerability, understanding of nuance, and natural conversation.
Internal testers report GPT-4.5 is warm, intuitive, and natural. When tasked with emotionally charged queries, it knows when to offer advice, diffuse frustration, or simply listen to the user. GPT-4.5 also shows stronger aesthetic intuition and creativity. It excels at helping users with their creative writing and design.
GPT-4.5 was pre-trained and post-trained on diverse datasets, including a mix of publicly available data, proprietary data from data partnerships, and custom datasets developed in-house, which collectively contribute to the model’s robust conversational capabilities and world knowledge.
Safety is limited to refusals, notably including refusals for medical or legal advice. Have they deliberately restricted those abilities to avoid lawsuits or to limit public perceptions of expertise being overtaken rapidly by AI?
They report no real change from previous safety evaluations, which seems reasonable as far as it goes. We're not to the really scary models yet, although it will be interesting to see if this produces noticably better tool-use and the type of recursive self-checking that's crucial for powering competent agents. They say it has those, and improved planning and "execution":
Based on early testing, developers may find GPT‑4.5 particularly useful for applications that benefit from its higher emotional intelligence and creativity—such as writing help, communication, learning, coaching, and brainstorming. It also shows strong capabilities in agentic planning and execution, including multi-step coding workflows and complex task automation.
They also say it's compute intensive, so not a replacement for 4o. This could be why they hadn't released Orion earlier. I wonder if this release is in response to Claude 3.7 taking top spots for most non-reasoning-appropriate tasks.
GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.
It's better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it's an expensive and huge model, I think we'd be talking about AI progress slowing down at this point if it weren't for reasoning models.
Possibly, but 1) There are reasoning models, 2) Value per token may still raise faster than cost per token for non-reasoning models which could be enough to sustain progress, and 3) It's possible that a more expensive non-reasoning model makes reasoning more efficient and/or effective by increasing the quality and complexity of each reasoning step.
At this point I pretty much never use 4o for anything. It's o1, o1-pro, or o3-mini-high. Looking forward to testing 4.5 though.