Curious if it's built on the same base model as Gemini 2.0 Pro, or on a completely new pretrained model. With 100K TPUv6e datacenters (about the same as 100K H100 in training compute), Gemini 2.0 Pro seems like it's underperforming in its weight class, compared to GPT-4.5 and even Grok 3 (likely trained on similar compute). So it makes some sense they'd write it off as a failed run and do another one, but alternatively long reasoning post-training could've fixed enough to get a good reasoning model. In which case the name choice breaks the pattern of Gemini 2.0 Flash Thinking, but could be a way of distancing the success of Gemini 2.5 Pro (the reasoning model) from the mediocre performance of Gemini 2.0 Pro (the chat model).
Google's TPUv6e systems have large scale-up world sizes, unlike Hopper systems. So there is a short window of a few months where they have the advantage in being able to more cheaply inference large reasoning models, unlike everyone else (unless they also use TPUs). Other AI companies would need access to Blackwell NVL36 or NVL72 in order to get reasonable cost and speed of inferencing large reasoning models, and it seems it'll take another 2-5 months before they are out in force.
Gemini 2.5 Pro Experimental is America’s next top large language model.
That doesn’t mean it is the best model for everything. In particular, it’s still Gemini, so it still is a proud member of the Fun Police, in terms of censorship and also just not being friendly or engaging, or willing to take a stand.
If you want a friend, or some flexibility and fun, or you want coding that isn’t especially tricky, then call Claude, now with web access.
If you want an image, call GPT-4o.
But if you mainly want reasoning, or raw intelligence? For now, you call Gemini.
The feedback is overwhelmingly positive. Many report Gemini 2.5 is the first LLM to solve some of their practical problems, including favorable comparisons to o1-pro. It’s fast. It’s not $200 a month. The benchmarks are exceptional.
(On other LLMs I’ve used in the past and may use again when they update: I’ve stopped using Perplexity entirely now that Claude has web access, I never use r1, and I only use Grok narrowly for when I need exactly real time reactions from Twitter.)
Table of Contents
Introducing Gemini 2.5 Pro
Pliny the Liberator pwned this on the spot of course, also building a new jailbreak prompt because the old prompt worked right away and that was no fun. And wow, I mean, yes it kind of is this easy:
It would be great to either actually have a robust system, or to let everyone have their fun without having to insert that kind of system prompt.
Their Lips are Sealed
I am highly disappointed in Google for its failure to properly document a model that is very, very clearly state of the art across the board.
Gemini 2.0 had the same problem, where Google shared very little information. Now we have Gemini 2.5, which is far more clearly pushing the SoTA, and they did it again.
The thing about this failure is that it is not simply irresponsible. It is also bad marketing, and therefore bad business. You want people seeing those details.
I don’t think Peter goes far enough here. This is a problem now. Or, rather, I don’t know if it’s a problem now, and that’s the problem. Now.
To be fair to Google, they’re against sharing information about their products in general. This isn’t unique to safety information. I don’t think it is malice, or them hiding anything. I think it’s operational incompetence. But we need to fix that.
How bad are they at this? Check out what it looks like if you’re not subscribed.
That’s it. There’s no hint as to what Gemini Advanced gets you, or that it changed, or that you might want to try Google AI Studio. Does Google not want customers?
I’m not saying do this…
…or even this…
…but at least try something?
Maybe even some free generations in the app and the website?
There was some largely favorable tech-mainstream coverage in places like The Verge, ZDNet and Venture Beat but it seems like no humans wasted substantial time writing (or likely reading) any of that and it was very pro forma. The true mainstream, such as NYT, WaPo, Bloomberg and WSJ, didn’t appear to mention it at all when I looked.
On Your Marks
One always has to watch out for selection, but this certainly seems very strong.
Note that Claude 3.7 really is a monster for coding.
Alas, for now we don’t have more official benchmarks. And we also do not have a system card. I know the model is marked ‘experimental’ but this is a rather widespread release.
Now on to Other People’s Benchmarks. They also seem extremely strong overall.
On Arena, Gemini 2.5 blows the competition away, winning the main ranking by 40 Elo (!) and being #1 in most categories, including Vision Arena. The exception if WebDev Arena, where Claude 3.7 remains king and Gemini 2.5 is well behind at #2.
Claude Sonnet 3.7 is of course highly disrespected by Arena in general. What’s amazing is that this is despite Gemini’s scolding and other downsides, imagine how it would rank if those were fixed.
That is ahead of everyone except o3-mini-high (61.4), o1-medium (70.8) and o1-pro (82.3). Speed-and-cost adjusted, it is excellent, but the extra work does matter here.
Here are some of his other benchmarks:
Note that lower is better here, Gemini 2.5 is best (and Gemma 3 is worst!):
Performance on his creative writing benchmark remained in-context mediocre:
The trueskill also looks mediocre but is still in progress.
The People Have Spoken
Image generation was the talk of Twitter, but once I asked about Gemini 2.5, I got the most strongly positive feedback I have yet seen in any reaction thread.
In particular, there were a bunch of people who said ‘no model yet has nailed [X] task yet, and Gemini 2.5 does,’ for various values of [X]. That’s huge.
These were from my general feed, some strong endorsements from good sources:
If you want a super positive take, there’s always Mckay Wrigley, optimist in residence.
For those who want to browse the reaction thread, here you go, they are organized but I intentionally did very little selection:
It’s worth noting that a lot of people will have a custom system prompt and saved information for Claude and ChatGPT but not yet for Gemini. And yes, you can absolutely customize Gemini the same way but you have to actually do it.
Things were good enough that these count as poor reviews.
There will always be those who are especially disappointed, such as this one, where Gemini 2.5 misses one instance of the letter ‘e.’
An unfortunate mistake, but accidents happen.
Adjust Your Projections
Like all frontier model releases (and attempted such releases), the success of Gemini 2.5 Pro should adjust our expectations.
Grok 3 and GPT-4.5, and the costs involved with o3, made it more plausible that things were somewhat stalling out. Claude Sonnet 3.7 is remarkable, and highlights what you can get from actually knowing what you are doing, but wasn’t that big a leap. Meanwhile, Google looked like they could cook small models and offer us large context windows, but they had issues on the large model side.
Gemini 2.5 Pro reinforces that the releases and improvements will continue, and that Google can indeed cook on the high end too. What that does to your morale is on you.