You remember that line in The Big Short, "How are you fucking us?"? That's how I feel about announcements from OpenAI these days. Somehow, there's some bullshit in there, because there's always some bullshit hiding in OpenAI announcements. There might also be a real substantive important advancement, but until someone figures out where the bullshit is hiding, I'm gonna be real skeptical of the claims.
The claim I'm squinting at real hard is this one:
We developed new techniques that make LLMs a lot better at hard-to-verify tasks.
Like, there's some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I'm willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would've docked points to slightly below the unimportant "gold" threshold).
But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks? I'm surprised I haven't seen anyone challenging that. (FYI, I do not think they are good examples of hard-to-verify tasks, fight me. Edit: Also, inasmuch as proofs have hard-to-verify properties like "being well-written", their model sure failed at that.) So pending some results in an actual hard-to-verify domain, or convincing technical arguments that this new approach would really generalize, I'm not buying that.
Overall... This is the performance I would have expected out of them just mindlessly RL...
The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don't care about form and only care about content and being logically persuasive.
The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It's still concerning, but I'm happy that it doesn't look noticeably less readable than the examples in the Baker et al paper.
Well, maybe there's some transfer? Maybe habits picked up from the CoT die hard & haven't been trained away with RLHF yet?
A curious thing about current LLMs is that they still don't reliably understand what a proof is (in the informal practical sense), or when it's appropriate to aim for one, even though they can solve problems that would benefit a lot from doing proofs. It's much easier for humans to reach the point of robustly understanding what a proof is than to keep practicing math with proofs, and at the margin LLMs might be failing at harder math (or in less popular settings) because they are still not doing it right. It also makes them much less useful even when they ...
Arguments made in https://epoch.ai/gradient-updates/what-will-the-imo-tell-us-about-ai-math-capabilities were so prescient!
...If the 2025 IMO happens to contain 0 or 1 hard combinatorics problems, it’s entirely possible that AlphaProof will get a gold medal just by grinding out 5 or 6 of the problems—especially if there happens to be a tilt toward hard geometry problems. This would grab headlines, but wouldn’t be much of an update over current capabilities. Still, it seems pretty likely: I give a 70% chance to AlphaProof winning a gold medal overa
The headline result was obviously going to happen, not an update for anyone paying attention.
The other claims are interesting. GPT-5 release will be a valuable data point and will allow us to evaluate the claim that this reasoning training was not task specific.
I don’t know if LLMs will become autonomous math researchers, but it seems likely to happen before other kinds of agency, since it has the best feedback loops and perhaps is just best suited to text-based reasoning. Might mean that I’m out of a job.
The headline result was obviously going to happen, not an update for anyone paying attention.
“Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!
Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.
I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.
Don't have the link, but it seems DeepMind researchers on X have tacitly confirmed they had already reached gold. What we don't know is whether it was done with a general LLM like OAI or a narrower one.
GPT-5 release will be a valuable data point
Doesn't seem like it'll be very informative about this, given the OP's: "Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months."
agent foundations research is what I'm talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don't know the entire theorem we want to ask for a proof of, we can show there aren't many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I've been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there's something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you're looking for plausible compressions, or simplifications or generalizations of an expression, or something.
Terrence Tao has been on mathstodon in response:
https://mathstodon.xyz/@tao/114881418225852441
He seems to have moved from "AI is decent but not that special" to "you can't compare AI to humans" i.e. from appreciation to denial 2 in the AI cycle which tends to go:
Intrigue -> Denial -> Appreciation -> Denial 2 -> Fear
Tao's reaction sounds like it might have something to do with this:
According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday.
He's saying that you can compare AI to humans, but to do that to a professional research standard, you need to meet the following criteria:
OpenAI is currently not attempting to meet this standard. They are hyping their product. That is their perogative, but Tao is under no obligation to participate in the hype and he's being clear about the conditions under which he would comment.
"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling"
My reactions