The takeaway from the OpenAI IMO entry is a system that produced somewhat legible correct proofs possibly without being overly math contest specialized. Which would then be a new emergent capability that is still absent in the current publicly available frontier LLMs (they can't reliably write proofs even for the difficult problems where they do manage to arrive at the correct answers).
The takeaway from the GDM entry is that we now have a system that does well-written correct proofs (even if it was somewhat specifically engineered to get this done, so that the capability isn't as emergent as in the OpenAI case). And this system might be useful in generating some kind of training data for next-gen LLMs that would be able to directly understand what practical proofs are, as individual (cheap, fast) reasoning models rather than as complicated scaffolds on top of reasoning models.
So there's now a concrete reason to expect this basic capability for practical proofs to become widespread in 2026, which is needed in all sorts of STEM applications, as well as for cheap reliability of conclusions about simple things. This is new, not something that for example the 2024 GDM IMO entry helped with, and it wasn't obvious that scaling RLVR at current levels of compute would enable this.
Getting proofs right is also directionally like writing code well rather than just making it do what you need, a less clearly verifiable aspect of quality, so this is also a positive update on programming capabilities of 2026 models.
which is needed in all sorts of STEM applications, as well as for cheap reliability of conclusions about simple things.
Reminds me a bit of this post on how LLM hasn't seen more use within theoretical computer science, and how it could become seen as more valuable.
While following STOC 2025 presentations, I started to pay attention to not what is present, but what is conspicuously missing: any references to the use of modern generative AI and machine learning to accelerate research in theoretical computer science.
Sure, there were a couple of lighthearted references to chatbots in the business meeting, and Scott Aaronson’s keynote concluded with a joke about generative AI doing our work soon, but it seemed like nobody is taking AI seriously as something that could transform TCS community into a better place.
It is not the case that TCS in general is still living in the 1970s: the TCS community is happy to work on post-quantum blockchain and other topics that might be considered modern or even hype. There were even some talks of the flavor “TCS for AI”, with TCS researchers doing work that aims at helping AI researchers understand the capabilities and limitations of generative chatbots. But where is work of the flavor “AI for TCS”? [...]
When I asked various colleagues about this, the first reaction was along the lines “but what could we possibly do with AI in TCS, they can’t prove our theorems (yet)?” So let me try to explain what I mean by “AI for TCS”, with the help of a couple of example scenarios of possible futures. [...]
All STOC/FOCS/SODA/ICALP submissions come with a link to an online appendix, where the main theorems of the paper have been formalized in Lean. This is little extra work for the authors, as they only need to write the statement of the theorem in Lean, and generative AI tools will then work together with Lean to translate the informal human-written proof into something that makes Lean happy.
I think this would be an amazing future. It would put advances in TCS on a much more robust foundation. It would help the program committees and reviewers get confidence that the submissions are correct. We would catch mistakes much earlier, before they influence follow-up work. It would make it much easier to build on prior work, as all assumptions are stated in a formally precise manner. [...]
Most STOC/FOCS/SODA/ICALP submissions come with a link to an online appendix, with efficient Rust implementations of all the new algorithms (and other algorithmic entities such as reductions) that were presented in the paper. The implementations come with test cases that demonstrate that the algorithm indeed works correctly, and there are benchmarks that compare the practical performance of the algorithm with all relevant prior algorithms. Everything is packaged as a neat Rust crate that practitioners can readily use with “cargo add” as part of their programs. This is little extra work for the authors, as generative AI tools will translate the informal human-written pseudocode into a concrete Rust implementation.
I think this would also be an amazing future. It would again help with gaining confidence that the results are indeed correct, but it would also help tremendously with ensuring that the work done in the TCS community will have broad impact in the industry: why not use the latest, fastest algorithms also in practice, if they are readily available as a library? It would also help to identify which algorithmic ideas are primarily of theoretical interest, and which also lead to efficient solutions on real-world computers, and it could inspire new work when TCS researchers see concretely that e.g. current algorithms are not well-compatible with the intricacies of modern superscalar pipelined vectorized CPUs and their memory hierarchies.
"Practical" sounds like you care more about guaranteed correct results than about understanding how you got them. If you want a "practical" proof, why isn't it better to produce something that a mechanical checker can verify than to produce something legible to a human?
An annoying thing about current chatbots is that even when they've plausibly solved a math problem, they often can't explain the solution (a proof explains truth of a proposition). Legible informal proofs are also more central to math than rigor or even correctness. Math is about understanding, occasionally you end up understanding something without yet being able to make it rigorous, and initial attempts at making it rigorous make it incorrect, but eventually it can be made correct in some other setting (that in principle might need to be invented first).
The ‘barely speak English’ part makes the solution worse in some ways but actually makes me give their claims to be doing something different more credence rather than less
I think people are overupdating on that. My impression is that gibberish like this is the default way RL makes models speak, and that they need to be separately fine-tuned to produce readable outputs. E. g., the DeepSeek-R1 paper repeatedly complained about "poor readability" with regards to DeepSeek-R1-Zero (their cold-start no-SFT training run).
Actually This Seems Like A Big Deal
If we take all of OpenAI’s statements at face value
Nitpick, but: I'm struck by how many times they mentioned the word "general". They seem to be trying to shoehorn it into every tweet. And it's weird. Like, of course it was done using a general system. That's, like, the default assumption? Whenever did OpenAI ship task-specific neurosymbolic systems, or whatever it is that would be the opposite of "general" here?
I guess maybe it makes sense if they're assuming that people would be primed by DeepMind to expect math-specific systems here? It still seems kind of excessive. I buy that they have a model that really did score that well on the IMO benchmark, but the more times they remind us how very general it is, the more I'm prompted to think that maybe it's not quite so general after all.
There's also the desperate rush with which they started talking about it: just after the closing ceremony, seemingly to adhere to the most bare-bones technically-correct interpretation of "don't steal the spotlight", and apparently without even having any blog post/official announcement prepared, and without having decided what they're willing to reveal about their technical setup.
I think it has all the telltale signs of being a heavily spun marketing op. There's truth at the core of it, yes, but I'm suspicious that all the "reading between the lines" that people are doing to spot the "big implications" here is precisely what the "marketing op" part of it was intended to cause.
Given that P3 was unusually easy this year
Not only P3, apparently all problems except P6 were really easy this year. I don't think this is particularly load-bearing for anything, but seems like another worthwhile incremental update to make.
Also, it's funny that we laugh at xAI when they say stuff like "we anticipate Grok will uncover new physics and technology within 1-2 years", but when an OpenAI employee goes "I wouldn’t be surprised if by next year models will be deriving new theorems and contributing to original math research", that's somehow more credible. Insert the "know the work rules" meme here.
(FWIW, I consider both claims pretty unlikely but not laughably incredible.)
now a bunch of robots can do it. as someone who has a lot of their identity and their actual life built around “is good at math,” it’s a gut punch. it’s a kind of dying. [...] multiply that grief out by *every* mathematician, by every coder, maybe every knowledge worker, every artist… over the next few years… it’s a slightly bigger story
Have there been any rationalist writings on this topic? This cluster of social dynamics, this cluster of emotions? Dealing with human obsolescence, the end of human ability to contribute, probably the end of humans being able to teach each other things, probably the end of humans thinking of each other as "cool"? I've read Amputation of Destiny. Any others?
Congratulations, as always, to everyone who got to participate in the 2025 International Mathematical Olympiad, and especially to the gold and other medalists. Gautham Kamath highlights 11th grader Warren Bei, who in his 5th (!) IMO was one of five participants with a perfect 42/42 score, along with Ivan Chasovskikh, Satoshi Kano, Leyan Deng and Hengye Zhang.
Congratulations to Team USA, you did not ‘beat China’ but 2nd place is still awesome. Great job, China, you got us this time, three perfect scores is crazy.
You’ve all done a fantastic, amazingly hard thing, and as someone who tried hard to join you and only got as far as the [****, year censored because oh man I am old] USAMO and would probably have gotten 0/45 on this IMO if I had taken it today, and know what it is like to practice for the USAMO in a room with multiple future IMO team members that must have thought I was an idiot, let me say: I am always in awe.
But that’s not important right now.
This Is About AI And It Is Kind Of A Big Deal
What matters is that Google and OpenAI have LLMs with gold medal performances, each scoring exactly the threshold of 35/42 by solving the first five of the six problems.
This is up from Google’s 28/42 performance last year, which was previously achieved with a longer time frame. The methods used by both are presented as being more general, whereas last year’s version was a more specialized effort.
The new scores were a 92nd percentile result at the event.
Google did this under official collaboration with the IMO, announced on Monday as per the IMO’s request. OpenAI did it on their own, so they announced a bit earlier, so we are taking their word on many details.
This was not expected. Prediction markets thought gold this year was unlikely.
What matters more is how they did it, with general purpose LLMs without tools, in ways that represent unexpected and large future gains in other reasoning as well.
The more I think about the details here, the more freaked out I get rather than less. This is a big deal. How big remains to be seen, as we lack details, and no one knows how much of this will generalize.
Skeptics Prematurely Declare Victory
The IMO 2025 results quickly came in for released models.
This was an early sign that problem 3 was easier than usual this year, and a strong performance by the release version of Gemini 2.5 Pro.
So this is how it started seven hours before OpenAI announced its result:
I disagree, I think this is a gotcha in the positive sense. People took ‘the AIs that weren’t aimed at this problem that are publicly released are only doing okay relative to the best humans, and have not proven themselves the best yet’ to be ‘look at the pathetic AIs,’ one day before we learned that, well, actually, in a way prediction markets did not expect.
I do think people need to update their models of the future.
Also, it’s kind of a full gotcha given this:
I don’t have time to investigate how ‘legit’ the Gemini 2.5 Pro solutions are, including in terms of how much you have to cheat to get them.
DeepMind Declares Victory
Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad. Google’s solutions are here.
Google’s answers were even in nice form.
What else did they say about how they did this?
That sounds mostly rather general. There’s some specialized IMO context, but orders of magnitude less than what IMO competitors devote to this.
Um, Elon, no, and I remind you that Grok 4 got 11.9%. Which for a human would be super impressive, but seriously, borderline trivial?
OpenAI Declares Victory
OpenAI claimed its victory first, right after the closing ceremony and before the party, whereas Google DeepMind waited to announce until the following Monday.
The most impressive thing about OpenAI’s result is that they claim this is not an IMO-specific model, and that it uses only general-purpose techniques.
Indeed. Purely getting the gold medal is surprising but not that big a deal. The way they got the result, assuming they’re reporting accurately? That’s a really big deal.
Bro I Don’t Know
What about Problem 6? Did the programs submit incorrect solutions?
Note that if you are maximizing, then when time runs out if you have anything at all then yes you do submit the best incorrect solution you have, because you might get you partial credit, although this rarely works out.
Not So Fast?
If one person gets to say ‘Not So Fast’ about this sort of thing, Tao is that one person.
It is entirely fair to say that if you don’t disclose conditions in advance, and definitely if you don’t disclose conditions after the fact, it is difficult to know exactly what to make of the result. Tao’s objections are valid.
The catch is that this is about grading the horse’s grammar, as opposed to the observation that the horse can talk and rather intelligently and with rapidly improving performance at that.
Thus, while the objections are valid, as long as we know the AIs had no access to outside tools or to the internet (which is confirmed), we should seek the answers to these other questions but the concerns primarily matter for comparisons between models, and within a reasonably narro (in the grand scheme of things) band of capabilities.
I also would note that if OpenAI did essentially do the ‘team thinks in parallel’ thing where it had multiple inference processes running simultaneously on multiple computers, well, that is something AIs can do in the real world, and this seems entirely fair for our purposes the same way humans can fire multiple neurons at once. It’s totally fair to also want a limited-compute or one-thread category or what not, but that’s not important right now.
To use Tao’s metaphor, if you took 99.99% of high school students, you could fully and simultaneously apply all these interventions other than formal proof assistants and internet searches or hints so clear they give you the first point on a question, and you still almost always get zero.
As a former not only math competitor but also Magic: The Gathering competitor, absolutely all these details matter for competitions, and I respect the hell out of getting all of those details right – I just don’t think that, in terms of takeaways, they change the answer much here.
In other words? Not Not So Fast. So Fast.
Not Announcing So Fast
OpenAI chose not to officially collaborate with the IMO. They announced their result after the IMO closing ceremony and prior to the IMO 2025 closing party. Those who did collaborate agreed to wait until the following Monday, which was when Google announced. By going first, OpenAI largely stole the spotlight on this from Google, yet another case of Google Failing Marketing Forever.
A question that was debated is, did OpenAI do something wrong here?
Mikhail Samin claimed that they did, and put their hype and clout ahead of the kids celebrating their achievements against the wishes of the IMO.
OpenAI’s Noam Brown replied that they waited until after the closing ceremony exactly to avoid stealing the spotlight. He said he was the only person at OpenAI to speak to anyone at the IMO, and that person only requested waiting until after the ceremony, so that is what OpenAI did.
Not collaborating with the IMO was a choice that OpenAI made.
I have reflected on this. It is not the main thing, the results are the main thing. I do think that on reflection while OpenAI did not break any agreements or their word, and strictly speaking they do not owe the IMO or the kids anything, and this presumably net increased the focus on the kids, this still does represent a meaningful failure to properly honor the competition and process, as well as offering us the proper opportunities for verification, and they should have known that this was the case. I do get that this was a small team’s last minute effort, which makes me more understanding, but it’s still not great.
Thus, I was careful to wait to write this until after Google’s results were announced, and have placed Google’s announcement before OpenAI’s in this post, even though due to claimed details by OpenAI I do think their achievement here is likely the more meaningful one. Perhaps that is simply Google failing marketing again and failing to share details.
Ultimately, the reason OpenAI stole my spotlight is that it harkens something general and new in a way that Google’s announcement doesn’t.
Are We Still Waiting On Anyone Else?
With Google sharing its results I don’t want to wait any longer, but note Harmonic?
This would be a weird flex if they didn’t also get gold, although it looks like they would have done it in a less general and thus less ultimately interesting way. On the flip side, they are not a big lab like Google or OpenAI, so that’s pretty impressive.
This Was Not Widely Expected
I think the failure to expect this was largely a mistake, but Manifold tells a clear story:
Final result, July 2025: 2025 (now). Buckle up, Dorothy.
Some people did expect it, some of whom offered caveats.
Jevons Paradox Strikes Again
The AIs took the IMO under the same time limits as the humans, and success was highly valued, so it is no surprise that they used parallel inference to get more done within that time frame, trading efficiency for speed.
Nothing To Worry About
Sure, you solved this particular problem, but that would never generalize, right? That part is the same hype as always?
A grand tradition is:
Or: AI can do the things it can do, and can’t do the things it can’t do, they’re hard.
An oldie but a goodie:
I mean, it probably can’t even do real math, right?
So yes, we still do not know for sure if being able to do [X] will extend to doing [Y], either with the same model or with a future different model, and [X] and [Y] are distinct skills such that the humans who do [X] cannot yet do [Y] and training humans to do [Y] does not give them the ability to do [X]. However please try to think ahead.
That is all entirely fair. An IMO problem is measured in hours, not months, and is bounded in important ways. That is exactly the paradigm of METR, and the one being talked about by Noam Brown and Alexander Wei, that we have now made the move from 10 minute problems to 100 minute problems.
That does not mean we can yet solve 10,000 minute or 1 million minute problems, but why would you expect the scaling to stop here? As I discussed in the debates over AI 2027, it makes sense to think that these orders of magnitude start to get easier rather than harder once you get into longer problems. If you can do 100 minute problems that doesn’t mean you can easily go to 1000 or a million, but if you can go 1 million, I bet you can probably do 1 billion without fundamentally changing things that much, if you actually have that kind of time. At some point your timeline is ‘indefinite’ or ‘well, how much time and compute have you got?’
I am very confident we are not ready. If we are fortunate we might survive, but we definitely are not ready.
So What If It Can Do The Math
I grade this as minus one million points for asking the wrong questions.
Never confuse costs and benefits #RulesForLife, and never reason from a price change.
(This defines ‘math’ rather narrowly as advanced Real Math that mathematicians and maybe quants and other professionals do, not the kind of math that underlies absolutely everything we do all day, since Fake Math is already mostly automated.)
The value of automating is not determined by how much we spent on it before it got automated. The value is determined by how much additional value we get out of something when we automate it, which might involve a lot more production and very diffuse benefits.
The Challenge Bet
Back in February 2022, Eliezer Yudkowsky bet with Paul Christiano about IMO performance by 2025. The results were not super clear cut if you look at the details, as Christiano was in large part doubting that the hardest problem would be solved and indeed the hardest problem was #6 and was not solved, but a gold medal was still achieved.
At this point, we have a lot of people who have updated far past 40% chance of transformational AI by 2040 and have 40% for dates like 2029.
Actually This Seems Like A Big Deal
If we take all of OpenAI’s statements at face value, think about what they actually did.
People On The Internet Sometimes Lie
Going back to Tao’s objections, we know essentially nothing about this new model, or about what Google did to get their result. Given that P3 was unusually easy this year, these scores are perhaps not themselves that terribly impressive relative to expectations.
Can we trust this? It’s not like OpenAI has never misled us on such things in the past.
In terms of the result being worthy of a 35/42, I think we can mostly trust that. They shared the solution, in its garbled semi-English, and if there was something that would have lost them points I think someone would have spotted it by now.
In terms of OpenAI otherwise cheating, we don’t have any proof about this but I think the chances of this are quite low. There’s different kinds of deception or lies, different parts of OpenAI are differently trustworthy, and this kind of lie is not in their nature nor do they have much incentive to try it given the chance it gets exposed, and the fact that if it’s not real then they won’t be able to pay it off later.
The place where one might doubt the most is, can we trust that what OpenAI did this time is more general, in the ways they are claiming?
The ‘barely speak English’ part makes the solution worse in some ways but actually makes me give their claims to be doing something different more credence rather than less. It also should worry anyone who wants to maintain monitorable chain of thought.
Then again, one could say that the version that does it better, and more naturally, is thus more important, for exactly the same reasons.
We only have OpenAI’s word on the details of how this went down. So what to think?
I am mostly inclined to believe them on the main thrust of what is going on. That doesn’t mean that this result will generalize. I do give them credit for having something that they believe came out of a general approach, and that they expect to generalize.
Still, it’s reasonable to ask what the catch might be, that there’s always going to be a catch. Certainly it is plausible that this was, as Miles suggested, RLed to within an inch of its life, and it starting to be unable to speak English is the opposite of what is claimed, that it is losing its generality, or things are otherwise going off the rails.
The thing is, to me this doesn’t feel like it is fake. It might not be a big deal, it might not transfer all that well to other contexts, but it doesn’t feel fake.
To wrap up, another reminder that no, you can’t pretend none of this matters, and both the Google and OpenAI results matter and should update you: