OpenAI Claims IMO Gold Medal

Mikhail Samin

LESSWRONG
LW

2 min read

76 OpenAI Claims IMO Gold Medal

by Mikhail Samin

19th Jul 2025

2 min read

76

This is a linkpost for https://x.com/alexwei_/status/1946477742855532918

I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).
We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.
Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).
Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.
https://github.com/aw31/openai-imo-2025-proofs/blob/main/problem_1.txt
Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.
Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.
If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model )
https://github.com/aw31/openai-imo-2025-proofs/

AI1

Frontpage

76

Mentioned in

59Google and OpenAI Get 2025 IMO Gold

OpenAI Claims IMO Gold Medal

4the gears to ascension

5Cole Wyeth

10the gears to ascension

New Comment

74 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:01 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]johnswentworth25d3721

You remember that line in The Big Short, "How are you fucking us?"? That's how I feel about announcements from OpenAI these days. Somehow, there's some bullshit in there, because there's always some bullshit hiding in OpenAI announcements. There might also be a real substantive important advancement, but until someone figures out where the bullshit is hiding, I'm gonna be real skeptical of the claims.

[-]Thane Ruthenis25d*179

The claim I'm squinting at real hard is this one:

We developed new techniques that make LLMs a lot better at hard-to-verify tasks.

Like, there's some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I'm willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would've docked points to slightly below the unimportant "gold" threshold).

But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks? I'm surprised I haven't seen anyone challenging that. (FYI, I do not think they are good examples of hard-to-verify tasks, fight me. Edit: Also, inasmuch as proofs have hard-to-verify properties like "being well-written", their model sure failed at that.) So pending some results in an actual hard-to-verify domain, or convincing technical arguments that this new approach would really generalize, I'm not buying that.

Overall... This is the performance I would have expected out of them just mindlessly RL... (read more)

6ryan_greenblatt24d

When Noam Brown says "hard-to-verify", I think he means that natural language IMO proofs are "substantially harder to verify": he says "proofs are pages long and take experts hours to grade". (Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, "hours to grade" is probably overstated?) Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially). I think this is overall reasonable if you interpret "hard-to-verify" as "substantially harder to verify" and I think this probably how many people would read this by default. I don't have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours. (Noam Brown doesn't say anything about competitive programming, so I'm not sure why you mentioned that. Competitive programming is trivial to verify.)

6Thane Ruthenis23d

Not sure about this. The kind of "hard-to-verify" I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it. And I don't think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the "correct" way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM. Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn't be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions... They say they're not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL'd it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn't really distinguish between "the new technique is fully general and works in all domains" and "the new technique looks like it should generalize because of secret technical details, but it doesn't actually, it only worked here because it exploited the underlying easy-to-verify properties of the task".

2Amalthea25d

Sure, math is not an example of a hard-to-verify task, but I think you're getting unnecessarily hung up on these things. It does sound like it may be a new and in a narrow sense unexpected technical development, and it's unclear how significant it is. I wouldn't try to read into their communications much more.

5Thane Ruthenis25d

I buy that, sure. I even buy that they're as excited about it as they present, that they believe/hope it unlocks generalization to hard-ot-verify domains. And yes, they may or may not be right. But I'm skeptical on priors/based on my model of ML, and their excitement isn't very credible evidence, so I've not moved far from said priors.

3Amalthea24d

Got it! I'm more inclined to generally expect that various half-decent ideas may unlock surprising advances (for no good reason in particular), so I'm less skeptical that this may be true. Also, while math is of course easy to verify, assuming they haven't significantly used verification in the training process, it makes their claims more reasonable.

[-]Leon Lang1mo180

The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don't care about form and only care about content and being logically persuasive.

[-]Rauno Arike1mo122

The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It's still concerning, but I'm happy that it doesn't look noticeably less readable than the examples in the Baker et al paper.

8MondSemmel1mo

Maybe tokens are expensive, so there's optimization pressure towards conciseness?

2Seth Herd26d

Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It's great that this hasn't harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT. There's still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it's become a unique language of thought. But that's a lot shakier than CoT can be read by any model.

8Leon Lang26d

As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.

[-]Daniel Kokotajlo26d116

Well, maybe there's some transfer? Maybe habits picked up from the CoT die hard & haven't been trained away with RLHF yet?

3Thane Ruthenis25d

I'd guess it has something to do with whatever they're using to automatically evaluate the performance in "hard-to-verify domains". My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.

4ryan_greenblatt24d

If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn't implement a clean up pass on the final solution (even though there wouldn't be any real pressure to save tokens in the final clean up).

1Thomas Dybdahl Ahle25d

I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.

5David Johnston26d

It's a research prototype, so probably not style tuned

5Tao Lin26d

This is OpenAI cot style. See it in the original o1 blog post. https://openai.com/index/learning-to-reason-with-llms/

7Leon Lang26d

Here is a screenshot of a chain of thought from the blog post you link: This looks different from the IMO solutions to me and doesn't have the patterns I mentioned. E.g., the sentences are grammatically complete.

3Mikhail Samin26d

This is false. It is exactly how RLed LLMs write.

4Leon Lang26d

Fwiw, I've recently used o3 a lot for requesting proofs, and it writes very differently. Could you give an example of an RLed LLM that writes like these examples? Though I agree with Rauno's comment that it does look like the chain of thought examples from the Baker et al. paper.

6Mikhail Samin25d

A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can't share that, sorry. The style is incredibly similar/immediately pattern-matched though. I think r1 should also have a similar style in its COT. The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.

3ACCount25d

Did you have access to the full o3 reasoning trace, or just the final output? The two are not the same style at all.

4Leon Lang25d

Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems. But as I see it now, the consensus seems to be something like "The chain of thought of new models does look like the IMO problem solutions, and if you don't train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model's answers were not yet trained to look nice". Is this your position? I think that's pretty plausible.

1ACCount25d

You get the gist. I don't think I've ever seen this specific style, but raw reasoning traces can end up looking even weirder.

3AlphaAndOmega1mo

It sounds a lot like what a human trying to tackle a difficult maths problem might mutter as they did so, without those bits trimmed out of the final product for clarity.

[-]Vladimir_Nesov1mo95

A curious thing about current LLMs is that they still don't reliably understand what a proof is (in the informal practical sense), or when it's appropriate to aim for one, even though they can solve problems that would benefit a lot from doing proofs. It's much easier for humans to reach the point of robustly understanding what a proof is than to keep practicing math with proofs, and at the margin LLMs might be failing at harder math (or in less popular settings) because they are still not doing it right. It also makes them much less useful even when they ... (read more)

1chasmani24d

I’m not sure I agree that is is easy for humans to robustly understand proofs. I think it takes really a lot of training to get humans to that point.

[-]bhishma1mo80

Arguments made in https://epoch.ai/gradient-updates/what-will-the-imo-tell-us-about-ai-math-capabilities were so prescient!

If the 2025 IMO happens to contain 0 or 1 hard combinatorics problems, it’s entirely possible that AlphaProof will get a gold medal just by grinding out 5 or 6 of the problems—especially if there happens to be a tilt toward hard geometry problems. This would grab headlines, but wouldn’t be much of an update over current capabilities. Still, it seems pretty likely: I give a 70% chance to AlphaProof winning a gold medal overa

... (read more)

3Afterimage26d

I have no idea of the maths, but reading through the epoch article it seems to me that this result is entirely unexpected. "but this year I’d only give a 5% chance to either a qualitatively creative solution or a solution to P3 or P6 from an LLM." Sure it's unreleased LLM but it still seems to be an LLM.

9tdko26d

Worth noting this year's p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though. From the author of the epoch article: https://x.com/GregHBurnham/status/1946655635400950211 https://x.com/GregHBurnham/status/1946725960557949227 https://x.com/GregHBurnham/status/1946567312850530522

8Afterimage26d

This is important context not only for evaluating Greg Burnham's accuracy but also for the Gold Medal headline. If this difficulty chart is accurate (still no idea on the maths), getting 5/6 is not much of a surprise. Even question 2 and 5 seem abnormally easy relative to previous years.

3Aaron Staley26d

To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the "unexpected" part being the problem being so easy. It's also the first time in 20 years (5% chance) that 5 problems were of difficulty <= 25. Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I'm somewhat less impressed. Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1). The two most impressive things are, factoring this write up by Ralph Furman: * OpenAI's LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods. * OpenAI's LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one -- just a matter if the LLM were smart enough)

-6Kabir Kumar26d

[-]Cole Wyeth1mo6-11

The headline result was obviously going to happen, not an update for anyone paying attention.

The other claims are interesting. GPT-5 release will be a valuable data point and will allow us to evaluate the claim that this reasoning training was not task specific.

I don’t know if LLMs will become autonomous math researchers, but it seems likely to happen before other kinds of agency, since it has the best feedback loops and perhaps is just best suited to text-based reasoning. Might mean that I’m out of a job.

[-]Garrett Baker1mo5431

The headline result was obviously going to happen, not an update for anyone paying attention.

“Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!

Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.

I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.

[-]Person1mo144

Don't have the link, but it seems DeepMind researchers on X have tacitly confirmed they had already reached gold. What we don't know is whether it was done with a general LLM like OAI or a narrower one.

2cdt25d

I think it was reasonable to expect GDM to achieve gold with an AlphaProof-like system. Achieving gold with a general LLM-reasoning system from GDM would be something else and it is important for discussion around this to not confuse one forecast for another. (Not saying you are, but that in general it is hard to tell which claim people are putting forward.)

2Garrett Baker24d

It seems your forecast here was wrong

3cdt24d

I don't believe anyone was forecasting this result, no. EDIT: Clarifying - many forecasts made no distinction whether an AI model had a major formal method component like AlphaProof or not. I'm drawing attention to the fact that the two situations are distinct and require distinct updates. What those are, I'm not sure yet.

2Garrett Baker24d

Oh I see, yeah that makes sense.

2Cole Wyeth1mo

Well, fair enough, but I did specify that the surrounding context was an update.

5Garrett Baker1mo

You said "The other claims are interesting" which maybe could include "this particular announcement", but not "at this point in time rather than later or sooner" or "by this particular company". I also object on the grounds that the "headline result" is not "not an update for anyone paying attention". To give proof, see this manifold market, which before the release of this model was at like 40%.

2Cole Wyeth1mo

So the market was previously around 85%, and then it went down as we got further through the year. I guess this proves that many people didn't expect it to happen in the next few months. The question wasn't really load bearing for my models, and you're right that I am not particularly interested that it happened at this point in time or by this particular company.

[-]Nick_Tarleton1mo110

GPT-5 release will be a valuable data point

Doesn't seem like it'll be very informative about this, given the OP's: "Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months."

-2Cole Wyeth1mo

I'm not going to take that too seriously unless GPT-5 performs well.

4the gears to ascension25d

it means you'd better figure out how to ask the right questions to a math-oracle LLM to finish your job quickly and be sure you did it right.

5Cole Wyeth25d

I thought about this sort of thing (adversarial robust augmentation) and decided it would be very hard to do it safely with something smarter than you. However, there may in fact be a window where LLMs are good at math but not agency and they can be used to massively accelerate agent foundations research.

[-]the gears to ascension25d100

agent foundations research is what I'm talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don't know the entire theorem we want to ask for a proof of, we can show there aren't many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I've been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there's something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you're looking for plausible compressions, or simplifications or generalizations of an expression, or something.

2Raemon25d

I agree with this comment but am kinda surprised you are the one saying it. I realize this isn't that strong an argument for "LLMs are actually good" but it happening about-now as opposed to like 6 months later seems like more evidence for them eventually being able to reliably to novel intellectual work.

6Cole Wyeth14d

By the way, I actually bet that the IMO gold would fall in 2025 and made a small (but very high percentage!) profit: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat# #537 among "top traders," made 106 mana. So, at least for me, this was apparently priced in - worth mentioning since your comment seems to possibly imply I should not have expected this a priori. (To be fair, it must have been a pretty good deal when I bought, something like 20%)

2Raemon14d

Well that is a cool thing to have on record. I believe you. :) At the time did you hold mostly the same "it's going to hit some kind of creativity / innovation wall eventually" beliefs? (or, however you'd summarize your take, I'm not 100% clear on it)

2Cole Wyeth24d

It’s the type of problem I expected LLMs to be able to solve - challenging proofs wirh routine techniques, probably no novel math concepts invented. if it’s in the envelope of the achievable with current techniques, the time to get there seems more a function of prioritization and developer skill, and not evidence about the limits of the paradigm. I guess it is a small update though - these longer proofs may require some agency.

[-]J Bostock26d5-13

Terrence Tao has been on mathstodon in response:

https://mathstodon.xyz/@tao/114881418225852441

He seems to have moved from "AI is decent but not that special" to "you can't compare AI to humans" i.e. from appreciation to denial 2 in the AI cycle which tends to go:

Intrigue -> Denial -> Appreciation -> Denial 2 -> Fear

[-]Kaj_Sotala25d179

Tao's reaction sounds like it might have something to do with this:

According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday.

[-]DirectedEvolution25d104

He's saying that you can compare AI to humans, but to do that to a professional research standard, you need to meet the following criteria:

Explicit about your methodology in advance
Specific about the comparison you are making
Justify your interpretation of your results

OpenAI is currently not attempting to meet this standard. They are hyping their product. That is their perogative, but Tao is under no obligation to participate in the hype and he's being clear about the conditions under which he would comment.

8mishka26d

I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond. But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition: https://mathstodon.xyz/@tao/114877789298562646. On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025. It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”. Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).

6J Bostock25d

If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it's weird and adversarial. But the threshold for Gold here is exactly 5/6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don't and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35/42 did so with scores of 7, 7, 7, 7, 7, 0. So there's that factor as well.

2mishka25d

Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).

4Nick_Tarleton25d

But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn't rewrite the questions, provide tools, intervene during the run, or hand-select answers. I don't have a theory of Tao's motivations, but if the post I linked is interpreted as a response to OpenAI's result (he didn't say it was, but he didn't say it wasn't and the timing makes it an obvious interpretation) raising those issues is bizarre.

4mishka25d

First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results). He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards. Although he is likely to have some degree of solidarity with the IMO viewpoint and to share some of their annoyance with timing of all this, e.g. https://www.reddit.com/r/math/comments/1m3uqi0/comment/n40qbe9/

1J Bostock25d

Ok, denial is too strong a word. I don't exactly know how to describe the mental motion he's doing though. By volume, his post thread is mostly discussions of ways in which this isn't a fair comparison, whereas the correct epistemic update is more like "OK so competition maths is solved, what does this mean next?". It's a level of garymarcusing where he doesn't disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry's opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.

2mishka25d

I would not say it is solved :-) I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-) (Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)

4gjm25d

My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don't think I ever actually verified this by talking to senior officials or checking the numbers. (As for Terry Tao, I agree with you that he is clearly not in denial, he's just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)

2mishka25d

Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points). I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).

3deepitreal26d

Just me or it seems he's making these 2 wrong assumptions? 1. He thinks this was a system (of models) like AlphaProof 2. That this model had internet access Surprising that no one on mathstodon has mentioned this. I wonder what he would say if he knew it was a single LLM without internet access.

2Thane Ruthenis25d

Honestly, that thread did initially sound kind of copium-y to me too, which I was surprised by, since his AI takes are usually pretty good[1] and level-headed. But it makes much more sense under the interpretation that this isn't him being in denial about AI performance, but him undermining OpenAI in response to them defecting against IMO. That's why he's pushing the "this isn't a fair human-AI comparison" line. 1. ^ Edit: For someone who doesn't "feel the ASI", I mean.

5Amalthea25d

I would not characterize Tao's usual takes on AI as particularly good (unless you compare with a relatively low baseline). He's been overall pretty conservative and mostly stuck to reasonable claims about current AI. So there's not much to criticize in particular, but it has come at the cost of him not appreciating the possible/likely trajectories of where things are going, which I think misses the forest for the trees.

3Thane Ruthenis25d

Oh, yeah, he's not superintelligence-pilled or anything. I was implicitly comparing with a relatively low baseline, yes.

1Anders Lindström25d

Your "Intrigue -> Denial -> Appreciation -> Denial 2 -> Fear" - cycle really hits the spot. I will filter future comments from AI pundit through this lens.

[-]Mikhail Samin1mo50

"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling"

[-]Rafael Harth1mo2-41

My reactions

It's probably good if companies are optimizing for IMO since I doubt this generalizes to anything dangerous.
This tweet thread is trying super hard to sound casual & cheerful but it is entirely an artificial persona, none of this is natural. It is extremely cringe.

Moderation Log