I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

Image

Why is this a big deal? First, IMO problems demand a new level of sustained creative thinking compared to past benchmarks. In reasoning time horizon, we’ve now progressed from GSM8K (~0.1 min for top humans) → MATH benchmark (~1 min) → AIME (~10 mins) → IMO (~100 mins).

Second, IMO submissions are hard-to-verify, multi-page proofs. Progress here calls for going beyond the RL paradigm of clear-cut, verifiable rewards. By doing so, we’ve obtained a model that can craft intricate, watertight arguments at the level of human mathematicians.

Image
https://github.com/aw31/openai-imo-2025-proofs/blob/main/problem_1.txt

Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

 In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold! 🥇

Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

 Still—this underscores how fast AI has advanced in recent years. In 2021, my PhD advisor @JacobSteinhardt had me forecast AI math progress by July 2025. I predicted 30% on the MATH benchmark (and thought everyone else was too optimistic). Instead, we have IMO gold.

If you want to take a look, here are the model’s solutions to the 2025 IMO problems! The model solved P1 through P5; it did not produce a solution for P6. (Apologies in advance for its … distinct style—it is very much an experimental model 😅)

https://github.com/aw31/openai-imo-2025-proofs/

New Comment


74 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

You remember that line in The Big Short, "How are you fucking us?"? That's how I feel about announcements from OpenAI these days. Somehow, there's some bullshit in there, because there's always some bullshit hiding in OpenAI announcements. There might also be a real substantive important advancement, but until someone figures out where the bullshit is hiding, I'm gonna be real skeptical of the claims.

The claim I'm squinting at real hard is this one:

We developed new techniques that make LLMs a lot better at hard-to-verify tasks. 

Like, there's some murkiness with them apparently awarding gold to themselves instead of IMO organizers doing it, and with that other competitive-programming contest at which presumably-the-same model did well being OpenAI-funded. But whatever, I'm willing to buy that they have a model that legitimately achieved roughly this performance (even if a fairer set of IMO judges would've docked points to slightly below the unimportant "gold" threshold).

But since when are math proofs, or competitive programming, considered good examples of hard-to-verify tasks? I'm surprised I haven't seen anyone challenging that. (FYI, I do not think they are good examples of hard-to-verify tasks, fight me. Edit: Also, inasmuch as proofs have hard-to-verify properties like "being well-written", their model sure failed at that.) So pending some results in an actual hard-to-verify domain, or convincing technical arguments that this new approach would really generalize, I'm not buying that.

Overall... This is the performance I would have expected out of them just mindlessly RL... (read more)

6ryan_greenblatt
When Noam Brown says "hard-to-verify", I think he means that natural language IMO proofs are "substantially harder to verify": he says "proofs are pages long and take experts hours to grade". (Yes, there are also things which are much harder to verify like things that experts strongly disagree about after years of discussion. Also, for IMO programs, "hours to grade" is probably overstated?) Also, I interpreted this as mostly in contrast to cases where outputs are trivial to programmatically verify (or reliably verify with a dumb LLM) in the context of large scale RL. E.g., you can trivially verify the answers to purely numerical math problems (or competitive programming or other programming situations where you have test cases). Indeed, OpenAI LLMs have historically been much better at numerical math problems than proofs, though possibly this gap has now been closed (at least partially). I think this is overall reasonable if you interpret "hard-to-verify" as "substantially harder to verify" and I think this probably how many people would read this by default. I don't have a strong view about whether this method will actually generalize to other cases where experts can verify things with high agreement in a few hours. (Noam Brown doesn't say anything about competitive programming, so I'm not sure why you mentioned that. Competitive programming is trivial to verify.)
6Thane Ruthenis
Not sure about this. The kind of "hard-to-verify" I care about is e. g. agenty behavior in real-world conditions. I assume many other people are also watching out for that specifically, and that capability researchers are deliberately aiming for it. And I don't think the proofs are any evidence for that. The issue is that there exists, in principle, a way to easily verify math proofs: by translating them into a formal language and running them through a theorem-verifier. So the "correct" way for gradient descent to solve this was to encode some sort of internal theorem-verifier into the LLM. Even more broadly, we know that improved performance at IMO could be achieved by task-specific models (AlphaGeometry, AlphaProof), which means that much of the IMO benchmark is not a strong signal of general intelligence. A general intelligence can solve it, but one oughtn't be a general intelligence for that, and since gradient descent prefers shortcuts and shallow solutions... They say they're not using task-specific methodology, but what does this actually mean? Does it mean they did not even RL the model on math-proof tasks, they RL'd it on something else and the gold-level IMO performance arose by transfer learning? Doubt it. Which means this observation doesn't really distinguish between "the new technique is fully general and works in all domains" and "the new technique looks like it should generalize because of secret technical details, but it doesn't actually, it only worked here because it exploited the underlying easy-to-verify properties of the task".
2Amalthea
Sure, math is not an example of a hard-to-verify task, but I think you're getting unnecessarily hung up on these things. It does sound like it may be a new and in a narrow sense unexpected technical development, and it's unclear how significant it is. I wouldn't try to read into their communications much more.
5Thane Ruthenis
I buy that, sure. I even buy that they're as excited about it as they present, that they believe/hope it unlocks generalization to hard-ot-verify domains. And yes, they may or may not be right. But I'm skeptical on priors/based on my model of ML, and their excitement isn't very credible evidence, so I've not moved far from said priors.
3Amalthea
Got it! I'm more inclined to generally expect that various half-decent ideas may unlock surprising advances (for no good reason in particular), so I'm less skeptical that this may be true.  Also, while math is of course easy to verify, assuming they haven't significantly used verification in the training process, it makes their claims more reasonable.    

The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don't care about form and only care about content and being logically persuasive. 

The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It's still concerning, but I'm happy that it doesn't look noticeably less readable than the examples in the Baker et al paper.

8MondSemmel
Maybe tokens are expensive, so there's optimization pressure towards conciseness?
2Seth Herd
Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It's great that this hasn't harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT. There's still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it's become a unique language of thought. But that's a lot shakier than CoT can be read by any model.
8Leon Lang
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.

Well, maybe there's some transfer? Maybe habits picked up from the CoT die hard & haven't been trained away with RLHF yet?

3Thane Ruthenis
I'd guess it has something to do with whatever they're using to automatically evaluate the performance in "hard-to-verify domains". My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
4ryan_greenblatt
If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn't implement a clean up pass on the final solution (even though there wouldn't be any real pressure to save tokens in the final clean up).
1Thomas Dybdahl Ahle
I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.
5David Johnston
It's a research prototype, so probably not style tuned
5Tao Lin
This is OpenAI cot style. See it in the original o1 blog post. https://openai.com/index/learning-to-reason-with-llms/
7Leon Lang
Here is a screenshot of a chain of thought from the blog post you link: This looks different from the IMO solutions to me and doesn't have the patterns I mentioned. E.g., the sentences are grammatically complete.
3Mikhail Samin
This is false. It is exactly how RLed LLMs write.
4Leon Lang
Fwiw, I've recently used o3 a lot for requesting proofs, and it writes very differently. Could you give an example of an RLed LLM that writes like these examples? Though I agree with Rauno's comment that it does look like the chain of thought examples from the Baker et al. paper. 
6Mikhail Samin
A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can't share that, sorry. The style is incredibly similar/immediately pattern-matched though. I think r1 should also have a similar style in its COT. The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
3ACCount
Did you have access to the full o3 reasoning trace, or just the final output? The two are not the same style at all.
4Leon Lang
Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems. But as I see it now, the consensus seems to be something like "The chain of thought of new models does look like the IMO problem solutions, and if you don't train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model's answers were not yet trained to look nice". Is this your position? I think that's pretty plausible. 
1ACCount
You get the gist. I don't think I've ever seen this specific style, but raw reasoning traces can end up looking even weirder.
3AlphaAndOmega
It sounds a lot like what a human trying to tackle a difficult maths problem might mutter as they did so, without those bits trimmed out of the final product for clarity. 

A curious thing about current LLMs is that they still don't reliably understand what a proof is (in the informal practical sense), or when it's appropriate to aim for one, even though they can solve problems that would benefit a lot from doing proofs. It's much easier for humans to reach the point of robustly understanding what a proof is than to keep practicing math with proofs, and at the margin LLMs might be failing at harder math (or in less popular settings) because they are still not doing it right. It also makes them much less useful even when they ... (read more)

1chasmani
I’m not sure I agree that is is easy for humans to robustly understand proofs. I think it takes really a lot of training to get humans to that point. 

Arguments made in https://epoch.ai/gradient-updates/what-will-the-imo-tell-us-about-ai-math-capabilities were so prescient! 

 

If the 2025 IMO happens to contain 0 or 1 hard combinatorics problems, it’s entirely possible that AlphaProof will get a gold medal just by grinding out 5 or 6 of the problems—especially if there happens to be a tilt toward hard geometry problems. This would grab headlines, but wouldn’t be much of an update over current capabilities. Still, it seems pretty likely: I give a 70% chance to AlphaProof winning a gold medal overa

... (read more)
3Afterimage
I have no idea of the maths, but reading through the epoch article it seems to me that this result is entirely unexpected.  "but this year I’d only give a 5% chance to either a qualitatively creative solution or a solution to P3 or P6 from an LLM." Sure it's unreleased LLM but it still seems to be an LLM. 
9tdko
Worth noting this year's p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though.   From the author of the epoch article: https://x.com/GregHBurnham/status/1946655635400950211 https://x.com/GregHBurnham/status/1946725960557949227 https://x.com/GregHBurnham/status/1946567312850530522
8Afterimage
This is important context not only for evaluating Greg Burnham's accuracy but also for the Gold Medal headline. If this difficulty chart is accurate (still no idea on the maths), getting 5/6 is not much of a surprise. Even question 2 and 5 seem abnormally easy relative to previous years. 
3Aaron Staley
To put into perspective, there was only an 8% chance P3 would be this easy, putting substantial weight on the "unexpected" part being the problem being so easy. It's also the first time in 20 years (5% chance) that 5 problems were of difficulty <= 25. Indeed, knowing that Gemini 2.5 Deep Think could solve an N25 (IMO result from Gemini 2.5 pro) and an A30 (known from Gemini 2.5 Deep think post), I'm somewhat less impressed.    Only barriers were a medium-ish geometry problem (P2), which of course alpha geometry could solve and an easy combinatorics (P1).   The two most impressive things are, factoring this write up by Ralph Furman: * OpenAI's LLM was able to solve a medium level geometry problem. (guessing Deepmind just used alpha geometry again) - Furman thought this would be hard for informal methods. * OpenAI's LLM is strong enough to get the easy combinatorics problem (Furman noted informal methods would likely outperform formal ones on this one -- just a matter if the LLM were smart enough)
-6Kabir Kumar

The headline result was obviously going to happen, not an update for anyone paying attention.

The other claims are interesting. GPT-5 release will be a valuable data point and will allow us to evaluate the claim that this reasoning training was not task specific.

I don’t know if LLMs will become autonomous math researchers, but it seems likely to happen before other kinds of agency, since it has the best feedback loops and perhaps is just best suited to text-based reasoning. Might mean that I’m out of a job. 

The headline result was obviously going to happen, not an update for anyone paying attention.

“Obviously going to happen” is very different from ‘happens at this point in time rather than later or sooner and with this particular announcement by this particular company’. You should still update off this. Hell, I was pretty confident this would be first done by Google DeepMind, so its a large update for me (I don’t know what for yet though)!

Your claim “not an update for anyone paying attention” also seems false. I’m sure there are many who are updating off this who were paying attention, for whatever reason, as they likely should.

I generally dislike this turn of phrase as it serves literally no purpose but to denigrate people who are changing their mind in light of evidence, which is just a bad thing to do.

Don't have the link, but it seems DeepMind researchers on X have tacitly confirmed they had already reached gold. What we don't know is whether it was done with a general LLM like OAI or a narrower one.

2cdt
I think it was reasonable to expect GDM to achieve gold with an AlphaProof-like system. Achieving gold with a general LLM-reasoning system from GDM would be something else and it is important for discussion around this to not confuse one forecast for another. (Not saying you are, but that in general it is hard to tell which claim people are putting forward.)
2Garrett Baker
It seems your forecast here was wrong
3cdt
I don't believe anyone was forecasting this result, no. EDIT: Clarifying - many forecasts made no distinction whether an AI model had a major formal method component like AlphaProof or not. I'm drawing attention to the fact that the two situations are distinct and require distinct updates. What those are, I'm not sure yet.
2Garrett Baker
Oh I see, yeah that makes sense.
2Cole Wyeth
Well, fair enough, but I did specify that the surrounding context was an update. 
5Garrett Baker
You said "The other claims are interesting" which maybe could include "this particular announcement", but not "at this point in time rather than later or sooner" or "by this particular company". I also object on the grounds that the "headline result" is not "not an update for anyone paying attention". To give proof, see this manifold market, which before the release of this model was at like 40%.
2Cole Wyeth
So the market was previously around 85%, and then it went down as we got further through the year. I guess this proves that many people didn't expect it to happen in the next few months. The question wasn't really load bearing for my models, and you're right that I am not particularly interested that it happened at this point in time or by this particular company. 

GPT-5 release will be a valuable data point

Doesn't seem like it'll be very informative about this, given the OP's: "Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months."

-2Cole Wyeth
I'm not going to take that too seriously unless GPT-5 performs well. 
4the gears to ascension
it means you'd better figure out how to ask the right questions to a math-oracle LLM to finish your job quickly and be sure you did it right.
5Cole Wyeth
I thought about this sort of thing (adversarial robust augmentation) and decided it would be very hard to do it safely with something smarter than you.  However, there may in fact be a window where LLMs are good at math but not agency and they can be used to massively accelerate agent foundations research. 

agent foundations research is what I'm talking about, yup. what do you ask the AI to make significant progress on agent foundations and be sure you did so correctly? are there questions we can ask where, even if we don't know the entire theorem we want to ask for a proof of, we can show there aren't many ways to fill in the whole theorem that could be of interest, so that we could, eg, ask an AI to enumerate what theorems could have a combination of agency-relevant properties? something like that. I've been procrastinating on making a whole post pitching this because I myself am not sure the idea has merit, but maybe there's something to be done here, and if there is it seems like it could be a huge deal. it might be possible to ask for significantly more complicated math to be solved, so maybe if you can frame it as something where you're looking for plausible compressions, or simplifications or generalizations of an expression, or something.

2Raemon
I agree with this comment but am kinda surprised you are the one saying it. I realize this isn't that strong an argument for "LLMs are actually good" but it happening about-now as opposed to like 6 months later seems like more evidence for them eventually being able to reliably to novel intellectual work.
6Cole Wyeth
By the way, I actually bet that the IMO gold would fall in 2025 and made a small (but very high percentage!) profit: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat# #537 among "top traders," made 106 mana.  So, at least for me, this was apparently priced in - worth mentioning since your comment seems to possibly imply I should not have expected this a priori.   (To be fair, it must have been a pretty good deal when I bought, something like 20%)
2Raemon
Well that is a cool thing to have on record. I believe you. :) At the time did you hold mostly the same "it's going to hit some kind of creativity / innovation wall eventually" beliefs? (or, however you'd summarize your take, I'm not 100% clear on it)
2Cole Wyeth
It’s the type of problem I expected LLMs to be able to solve - challenging proofs wirh routine techniques, probably no novel math concepts invented.  if it’s in the envelope of the achievable with current techniques, the time to get there seems more a function of prioritization and developer skill, and not evidence about the limits of the paradigm.  I guess it is a small update though - these longer proofs may require some agency. 

Terrence Tao has been on mathstodon in response:

https://mathstodon.xyz/@tao/114881418225852441

He seems to have moved from "AI is decent but not that special" to "you can't compare AI to humans" i.e. from appreciation to denial 2 in the AI cycle which tends to go:

Intrigue -> Denial -> Appreciation -> Denial 2 -> Fear

Tao's reaction sounds like it might have something to do with this:

According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.

According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.

OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.

Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday.

He's saying that you can compare AI to humans, but to do that to a professional research standard, you need to meet the following criteria:

  • Explicit about your methodology in advance
  • Specific about the comparison you are making
  • Justify your interpretation of your results

OpenAI is currently not attempting to meet this standard. They are hyping their product. That is their perogative, but Tao is under no obligation to participate in the hype and he's being clear about the conditions under which he would comment.

8mishka
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond. But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition: https://mathstodon.xyz/@tao/114877789298562646. On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025. It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”. Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).
6J Bostock
If it resembles the International Chemistry Olympiad (which, like most I[X]Os is based on the IMO) then yeah it's weird and adversarial. But the threshold for Gold here is exactly 5/6 questions fully correct, which is also a natural breakpoint. This happens since generally you either have a proof or you don't and getting n-1/n points usually means something like missing out a single case in a proof by exhaustion, which is much less common than just failing to produce a proof. Most of the people who got 35/42 did so with scores of 7, 7, 7, 7, 7, 0. So there's that factor as well.
2mishka
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).
4Nick_Tarleton
But most of his specific methodological issues are inapplicable here, unless OpenAI is lying: they didn't rewrite the questions, provide tools, intervene during the run, or hand-select answers. I don't have a theory of Tao's motivations, but if the post I linked is interpreted as a response to OpenAI's result (he didn't say it was, but he didn't say it wasn't and the timing makes it an obvious interpretation) raising those issues is bizarre.
4mishka
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results). He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards. Although he is likely to have some degree of solidarity with the IMO viewpoint and to share some of their annoyance with timing of all this, e.g. https://www.reddit.com/r/math/comments/1m3uqi0/comment/n40qbe9/
1J Bostock
Ok, denial is too strong a word. I don't exactly know how to describe the mental motion he's doing though. By volume, his post thread is mostly discussions of ways in which this isn't a fair comparison, whereas the correct epistemic update is more like "OK so competition maths is solved, what does this mean next?". It's a level of garymarcusing where he doesn't disagree with any facts on the ground but the overall vibe of the piece totally misses the wood for the trees in a particular and consistent direction. Terry's opinions on maths AI (which one would hope to be a useful data point) are being relegated to a lagging indicator by this mental motion.
2mishka
I would not say it is solved :-) I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-) (Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)
4gjm
My recollection from the late 1980s when I was doing IMOs is that the proportions were supposed to be something like 6:3:2:1 nothing:bronze:silver:gold, so about 8% gold medals. I don't think I ever actually verified this by talking to senior officials or checking the numbers. (As for Terry Tao, I agree with you that he is clearly not in denial, he's just cross at OpenAI for preferring PR over (1) good science and (2) politeness.)
2mishka
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points). I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).
3deepitreal
Just me or it seems he's making these 2 wrong assumptions? 1. He thinks this was a system (of models) like AlphaProof 2. That this model had internet access Surprising that no one on mathstodon has mentioned this. I wonder what he would say if he knew it was a single LLM without internet access.
2Thane Ruthenis
Honestly, that thread did initially sound kind of copium-y to me too, which I was surprised by, since his AI takes are usually pretty good[1] and level-headed. But it makes much more sense under the interpretation that this isn't him being in denial about AI performance, but him undermining OpenAI in response to them defecting against IMO. That's why he's pushing the "this isn't a fair human-AI comparison" line. 1. ^ Edit: For someone who doesn't "feel the ASI", I mean.
5Amalthea
I would not characterize Tao's usual takes on AI as particularly good (unless you compare with a relatively low baseline). He's been overall pretty conservative and mostly stuck to reasonable claims about current AI. So there's not much to criticize in particular, but it has come at the cost of him not appreciating the possible/likely trajectories of where things are going, which I think misses the forest for the trees. 
3Thane Ruthenis
Oh, yeah, he's not superintelligence-pilled or anything. I was implicitly comparing with a relatively low baseline, yes.
1Anders Lindström
Your "Intrigue -> Denial -> Appreciation -> Denial 2 -> Fear" - cycle really hits the spot. I will filter future comments from AI pundit through this lens.

"We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling"

My reactions

  • It's probably good if companies are optimizing for IMO since I doubt this generalizes to anything dangerous.
  • This tweet thread is trying super hard to sound casual & cheerful but it is entirely an artificial persona, none of this is natural. It is extremely cringe.