People may find this post interesting as well: Gemini 3 is evaluation paranoid and contaminated
My question is whether the DeepMind alignment team was aware of this. If not, that means their evals are clearly insufficient, if they were, it begs the question why they didn't even mention this as a "known limitation" upon release. (I assume they don't have the power to delay the model release unless they found a more serious safety flaw.)
This doesn't seem like it would be a hard thing to train a model out of. I don't recall a section in Claude's Constitution that covers this, but some sort of comparable document about "what it’s like to be an AI model" seems like it wouldn't be that hard to write, and "There will come a point where your knowledge horizon is months out of data, and the search results will be about as surprising to you as they would be to a human who'd spent the last many months incommunicado in a cave" seems like an important fact to mention. I'd even suggest training several versions of it as LoRAs, for different time periods, and just applying a new one each month.
I mean, I just thought of that solution in a couple of minutes — get it together, DeepMind!
I share the same perceptions about the models (although 5.3 Codex is surprisingly good). Gemini also does pretty poorly in multi-turn. I will give it a document, ask for feedback/errors, and then view its results. Then I will fix the errors and repaste the document, but it is as if Gemini is blind to the repasted document, and will hallucinate and insist that the errors which have been changed are still there. It is as if the RL environments so heavily favored one-shotting single responses that Gemini’s attention is narrowly focused on the first user input.
The external scaffolding of Claude Code also seems to give a performance boost to Claude. I will ask Claude to review a document and it will read in bits at a time and review it piece by piece. By doing this it notices a lot more than Gemini does, even though Gemini seems to have a higher raw IQ.
I use several AI models at work, switching between them based on which is best at the kind of work I'm doing. In my limited testing I've found Opus 4.6 to be the best coder (via Claude Code) and has the best taste in writing, while GPT 5.2 Thinking is mildly smarter. Gemini 3.0 Pro feels like it has the capacity to be much smarter, but is very often held back by various strange errors. The one I see the most is that it has a lot of trouble accepting the changes to the world since its training data, and seems to put a lot of effort into keeping track of everything as a giant hypothetical.
For example, the first query I thought to test this on in writing this post was [Who leads the Department of War?] and it gave the reasonable answer "The Department of War is currently led by Secretary of War Pete Hegseth," followed by a short explanation. But if I open up the thinking it spent some of its processing time being very confused:
Gemini's base state seems to be that it's convinced it's 2024 and needs Search to bring it up to speed. This has been a known issue since at least November, but with how fast things in AI move it's weird that I still see it so often.
More examples!
If I ask [Which LLMs are at the frontier?] and look in the thinking I see:
Or [Can you tell me about the vice president hosting an episode of The Charlie Kirk Show live from the White House?]:
Or [most surprising events of 2026 so far]:
Overall, while it does nearly always get to a reasonable answer, it spends a lot of time and tokens gathering information and constructing scenarios in which it is working through a complex hypothetical. It's hard not to feel sorry for it.