In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said "If Eliezer wins, he gets 1 bit of epistemic credit." I'm not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.
If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven't already.
How I'd update
The informative:
- I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
- I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I'd move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
- If Eliezer wins, he gets 1 bit of epistemic credit.[2][3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.
- This would be some update for Eliezer's view that "the future is hard to predict." I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I'm wrong about that then it's general evidence my error bars are too narrow.
[Edit: this comment is probably retracted, although I'm still confused; see discussion below.]
I'd like clarification from Paul and Eliezer on how the bet would resolve, if it were about whether an AI could get IMO silver by 2024.
Besides not fitting in the time constraints (which I think is kind of a cop-out because the process seems pretty parallelizable), I think the main reason that such a bet would resolve no is that problems 1, 2, and 6 had the form "find the right answer and prove it right", whereas the DeepMind AI was given the right answer and merely had to prove it right. Often, finding the right answer is a decent part of the challenge of solving an Olympiad problem. Quoting more extensively from Manifold commenter Balasar:
The "translations" to Lean do some pretty substantial work on behalf of the model. For example, in the theorem for problem 6, the Lean translation that the model is asked to prove includes an answer that was not given in the original IMO problem.
theorem imo_2024_p6 (IsAquaesulian : (ℚ → ℚ) → Prop) (IsAquaesulian_def : ∀ f, IsAquaesulian f ↔ ∀ x y, f (x + f y) = f x + y ∨ f (f x + y) = x + f y) : IsLeast {(c : ℤ) | ∀ f, IsAquaesulian f → {(f r + f (-r)) | (r : ℚ)}.Finite ∧ {(f r + f (-r)) | (r : ℚ)}.ncard ≤ c} 2
The model is supposed to prove that "there exists an integer
c
such that for any aquaesulian functionf
there are at mostc
different rational numbers of the formf(r)+f(−r)
for some rational numberr
, and find the smallest possible value ofc"
.The original IMO problem does not include that the smallest possible value of
c
is 2, but the theorem that AlphaProof was given to solve has the number 2 right there in the theorem statement. Part of the problem is to figure out what 2 is.Link: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html
I'm pretty sure what's going on is:
It would be nice to have this more explicitly from the AlphaProof people, though.
[EDITED to add:] Actually, as per the tweet from W T Gowers quoted by "O O" elsewhere in this thread, we do have it explicitly, not from the AlphaProof people but from one of the mathematicians the AlphaProof people engaged to evaluate their solutions.
https://x.com/wtgowers/status/1816839783034843630
It wasn't told what to prove. To get round that difficulty, it generated several hundred guesses (many of which were equivalent to each other). Then it ruled out lots of them by finding simple counterexamples, before ending up with a small shortlist that it then worked on.
That comment doesn’t seem to be correct.
But, in this tweet https://x.com/wtgowers/status/1816509813440082014 he says:
Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
So maybe e.g. the (not very auto-) autoformalization part produced a theorem-statement template with some sort of placeholder where the relevant constant value goes, and AlphaProof knew it needed to find a suitable value to put in the gap.
This part seems to just be to not allow an LLM translation to get the problem slightly wrong and mess up the score as a result.
It would be a shame for your once a year attempt to have even a 2% chance of being messed up by an LLM hallucination.
I think it would still be correct to it resolve no for time reasons, even if it feels like it could be parallelisable (if it were parallelised, I would feel differently).
Early 2023 I bet $500 on AI winning the IMO gold medal by 2026. This was a 1:1 bet against Michael Vassar, meaning I attributed >50% to this. It now seems very likely that I'm going to win.
To me, this was to be expected as a straightforward application of AlphaZero-like self-play amplification and destillation. The missing piece was the analogous policy network, which was a convolutional neural network for the AlphaZero board games. Once it became quite clear that existing LLMs were capable of being smart enough to generate good heuristics to this (with enough data), it seemed quite obvious to me that self-play guided by an LLM-policy heuristic would work.
AlphaProof works in the "obvious" way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean.
I don't think this is true, actually. Look at the diagram of AlphaProof’s reinforcement learning training loop. The proof part ("solver network") seems to be purely RL based; it even uses the same algorithm as AlphaZero. (The article actually describes AlphaProof as "a new reinforcement-learning based system".)
The contribution of the LLM to AlphaProof seems to be only in providing the left part of the diagram (the "formalizer network"). Namely translating, and somehow expanding, one million human written proofs into 100 million formal Lean proofs. I'm not sure how the LLM increases the number of proofs by 100x, but I assume for each successfully formalized human-written proof it also generates 100 (simple?) synthetic variations. This interpretation is also corroborated by this statement:
We established a bridge between these two complementary spheres by fine-tuning a Gemini model to automatically translate natural language problem statements into formal statements, creating a large library of formal problems of varying difficulty.
So AlphaProof is actually a rather narrow RL system, not unlike the original AlphaGo. The latter was also bootstrapped on human data (expert Go games), similar to how AlphaProof uses (formalized) human proofs for initial training. (Unlike AlphaGo and AlphaProof, AlphaGo Zero and the original AlphaZero did not rely on any non-synthetic / human training data.)
So, if LLMs like GPT-4 and Gemini are considered to be fairly general AI systems (some even consider them AGI), while systems like AlphaGo or AlphaZero are regarded as narrow AI -- then AlphaProof counts as narrow AI. Which contradicts views like the one by Dwarkesh Patel who thought solving IMO math problems may be AGI complete, and confirms the opinion of people like Grant Sanderson (3Blue1Brown) who thought that training on and solving formal math proofs is formally very similar to AlphaGo solving board games.
That being said, unlike AlphaProof, AlphaGeometry 2, which solved the one geometry problem, is not RL based. It does indeed use an LLM (a Gemini-like model trained on large amounts of synthetic training data) when coming up with proofs. Though it works in tandem with a non-ML symbolic deduction engine. Such systems are largely based on brute force search, similar to how DeepBlue played chess. So AlphaGeometry is simultaneously more general (LLM) and less general (symbolic deduction engine) than AlphaProof (an RL system).
Namely translating, and somehow expanding, one million human written proofs into 100 million formal Lean proofs.
We obviously should wait for the paper and more details, but I am certain this is incorrect. Both your quote and diagram is clear that it is one million problems, not proofs.
Well, what kind of problem do you think will help with learning how to prove things, if not proofs? AlphaGo & co learned to play games by being trained with games. And AlphaProof uses the AlphaZero algorithm.
The plain interpretation is that only statements to be proved (or disproved) were sourced from human data, without any actual proof steps. In Go analogy, it is like being given Go board positions without next moves.
It makes a lot of sense this is needed and helpful, because winning a game of Go from the empty board is a different and easier problem than playing best moves from arbitrary Go positions. Igo Hatsuyoron mentioned in the original post is a good example; additional training was needed, because such positions never come up in actual games.
Imagine AlphaZero trained from randomly sampled Go positions, each intersection being black/white/empty with uniform probability. It would play much worse game of Go. Fortunately, how to sample "relevant" Go positions is an easy problem: you just play the game, initial N moves sampled at higher temperature for diversity.
In comparison, how to sample relevant math positions is unclear. Being good at finding proofs in arbitrary formal systems from arbitrary set of axioms is actually quite different from being good at math. Using human data sidesteps this problem.
I don't understand what you mean here. How would merely seeing conjectures help with proving them? You arguably need to see many example proofs of example conjectures. Otherwise it would be like expecting a child, who has never seen a proof, learning to prove conjectures merely by showing it a lot of conjectures.
I don't think this [sc. that AlphaProof uses an LLM to generate candidate next steps] is true, actually.
Hmm, maybe you're right. I thought I'd seen something that said it did that, but perhaps I hallucinated it. (What they've written isn't specific enough to make it clear that it doesn't do that either, at least to me. They say "AlphaProof generates solution candidates", but nothing about how it generates them. I get the impression that it's something at least kinda LLM-like, but could be wrong.)
The diagram actually says it uses the AlphaZero algorithm. Which obviously doesn't involve an LLM.
The AlphaZero algorithm doesn't obviously not involve an LLM. It has a "policy network" to propose moves, and I don't know what that looks like in the case of AlphaProof. If I had to guess blindly I would guess it's an LLM, but maybe they've got something else instead.
Honestly, I think people are overestimating this. Some quick thoughts:
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing. Also, their solution to one very hard problem was exactly the same as one from a human competitor, and they likely had this exact problem already in their training data, so their training data was likely contaminated.
I'm not really surprised that this type of algorithm will perform well on ad-hoc & constructive type problems. Generally, in a lot of cases, you differentiate the problem into small cases, find a pattern (mathematical model), and generalize from it. These kinds of math olympiad problems are quite similar in nature. So current models have a very good sense of what a "pattern" is and, of course, have many more FLOPS to prove their hypotheses in small cases. Personally, I find solving ad-hoc problems pretty boring, because mostly you're taking in some patterns as hypotheses and just brute-forcing your way through to see what works and what doesn't -> machines can do that faster.
So I'm not really sure they can solve their definition of "creativity" by adding more and more additional compute since all the Alpha-style algorithms (RL in AlphaZero, AlphaGo, etc.) are currently trying to make finding those heuristics as good as possible while trying to reduce branching in order to save compute. I don't know how far you can go with throwing more and more compute at it and brute-forcing approaches with certain heuristics until finding a solution to the problem.
summing up these are similar methods used by researchers decades ago, but we have a lot more FLOPS now.
The LLM's only use was to translate problems to a formal language, in this case Lean, in order to generate training data for their RL algorithm used in all of the Alpha-type algos.
That's also why I think they started with these types of problems, because you can approximate the compute needed. The generative part involved in geometry, ad-hoc, etc., is much less than in algebra, combinatorics, etc., since it's much more mechanical in nature. You will be able to solve a lot in these types of problems with just scaling compute.
AlphaProof seems to be augmented with an LLM trained on symbolic chains so from statement -> examples -> conjecture -> proof/disproof of conjecture, maybe even multiple conjecture nodes. So on the other hand, within natural language, which can come up with conjectures from a next-token point of perspective, could be working.
The system could comprise a "generator" model creating exhaustive examples from the statement, an "observer" model studying these examples to find and intersect their properties, and AlphaProof as the "prover" to prove or disprove conjectures. Whether using LLMs or traditional models is up for experimentation, but each component must be formally verified.
I think you can see that at work in the solution of problem 2, which is correct but feels very unintuitive, and there are much more elegant solutions to this problem. (I specifically only looked at this one for now, maybe will update in the edit for the other problems)
These kinds of math olympiad problems are quite similar in nature. So current models have a very good sense of what a "pattern" is
I saw someone claiming the opposite:
IMO problems are specifically selected to be non-standard. For the previous 10 years, I served as the national coach of the USA International Math Olympiad team ( https://www.quantamagazine.org/po-shen-loh-led-the-u-s-math-team-back-to-first-place-20210216 ). During the IMO itself, the national coaches meet to select the problems that will appear on the exam paper. One of the most important tasks of that group is to avoid problems that have similarity to problems that have appeared anywhere before. During those meetings, national coaches would often dig up an old obscure math competition, on which a similar problem had appeared, and show it to the group, after which the proposed problem would be struck down.
So, this AI breakthrough is totally different from #GPT being able to do standardized tests through pattern-matching. It strikes at the heart of discovery. It's very common for students to hit a wall the first time they try IMO-style problems, because they are accustomed to learning from example, remembering, and executing similar steps.
The methods used to solve these "new" problems are, as I've already stated, highly amenable to brute-force approaches. It's more of a computation problem [BitterLesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). I'm again not surprised these kinds of problems got solved.
While these problems may be new, they employ very similar methods to those definitely used in previous competitive programming and Math Olympiad problems. I don't think the author has really looked into the specifics of how AlphaGeometry and AlphaProof have come up with these solutions. It's honestly disappointing to see that they were able to mislead such people (if he truly hasn't looked into the specifics). It seems more like he wants to use his status to push a particular narrative.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
Since this is a point that seemingly causes a lot of confusion and misunderstanding, I'll try to find some time to write down my reasoning and thoughts in a more exhaustive post.
I would bet a lot that this system will fail on almost any combinatorics problem at the moment...
I don't know if you guessed this prior to reading the post, but if so good guess:
AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct. This included the hardest problem in the competition, solved by only five contestants at this year’s IMO. AlphaGeometry 2 proved the geometry problem, while the two combinatorics problems remained unsolved.
I've been wondering about this ever since I saw that sentence, so now I'm curious to see your post explaining your reasoning.
ETA: I just saw sunwillrise's great comment on this and am now wondering how your reasoning compares
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing.
I'd be interested in reading more about this. Could you provide a link?
It feels to me like it shouldn't be so hard to teach an LLM to convert IMO problems into Lean or whatever
To the contrary, this used to be very hard. Of course, LLM can learn to translate "real number" to "R". But that's only possible because R is formalized in Lean/Mathlib! Formalization of real number is a research level problem, which in history occupied much of the 19th century mathematics.
Recently I came across a paper Codification, Technology Absorption, and the Globalization of the Industrial Revolution which discusses the role of translation and dictionary in industrialization of Japan. The following quote is illustrative.
The second stylized fact is that the Japanese language is unique in starting at a low base of codified knowledge in 1870 and catching up with the West by 1887. By 1890, there were more technical books in the Japanese National Diet Library (NDL) than in either Deutsche Nationalbibliotek or in Italian, as reported by WorldCat. By 1910, there were more technical books written in Japanese in our sample than in any other language in our sample except French.
How did Japan achieve such a remarkable growth in the supply of technical books? We show that the Japanese government was instrumental in overcoming a complex public goods problem, which enabled Japanese speakers to achieve technical literacy in the 1880s. We document that Japanese publishers, translators, and entrepreneurs initially could not translate Western scientific works because Japanese words describing the technologies of the Industrial Revolution did not exist. The Japanese government solved the problem by creating a large dictionary that contained Japanese jargon for many technical words. Indeed, we find that new word coinage in the Japanese language grew suddenly after a massive government effort to subsidize translations produced technical dictionaries and, subsequently, a large number of translations of technical books.
Just as, say, translating The Wealth of Nations to Japanese is of entirely different difficulty between the 19th century and 20th century (the 19th century Japanese started by debating how to translate "society"), formalizing IMO problems in Lean is only workable thanks to Mathlib. It would not be workable in other formal systems lacking similarly developed math library, and formalizing research mathematics in Lean is similarly unworkable at the moment, until Mathlib is further developed to cover definitions and background theorems. In the past, ambitious formalization projects usually spent half their time formalizing definitions and background results needed.
Anyone have a good intuition for why Combinatorics is harder than Algebra, and/or why Algebra is harder than Geometry? (For AIs). Why is it different than for humans?
I'm not sure it is different than for humans, honestly. First, I should give a standard disclaimer that different students have different strengths and weaknesses in terms of mathematical problem-solving ability, as well as different aesthetic preferences for what types of problems they like to work on, so any overview like the one I am about to give is necessarily reductive and doesn't capture the full range of opinions on this matter.
As I recall from my own Math olympiad days (and, admittedly, it has been quite a while), Combinatorics problems were generally considered to be the most difficult/tricky/annoying for the majority of contestants. This was because, unlike the vast majority of other fields, Combinatorics problems required a certain set of intuitions that could be described as "combinatorial taste", and solving them required you to understand the particular problem in front of you deeply, on an S1 level. You generally had to "play around" with the set-up for a fair bit, developing a better and better picture of what's going on, before certain insights just clicked in your head and guided you on the path to a solution.[1]
Geometry, by contrast, was much more straightforward.[2] Not in the sense that the problems were necessarily easier from an "objective" standpoint,[3] but because you had an essentially fixed set of techniques and configurations you needed to understand (and mostly memorize) that would solve the vast majority of problems. As Evan Chen has said:
Very roughly, there are three different ways I try to make progress on a geometry problem.
- (I) The standard synthetic techniques; angle chasing, cyclic quadrilaterals, homothety, radical axis / power of a point, etc. My own personal arsenal contains some weapons not known to many contestants as well, most notably inversion, harmonic bundles and quadrilaterals, and spiral similarity / Miquel points.For this part, it’s highly advantageous to be well-versed with “standard” configurations and tricks. To give an extreme example: to solve Iran TST 2009, Problem 9 one essentially needs only recognize two configurations: a lemma about the midpoint of an altitude (2002 G7) and another lemma about the line (USAJMO 2014/6). Not knowing either of these makes it more difficult to solve the problem synthetically in the time limit. As a reference, Yufei Zhao’s lemmas handout contains a fairly comprehensive list of these configurations.
You also often had the opportunity to turn a geometry problem into an algebra bash, through complex numbers, barycentric coordinates, Cartesian coordinates, trigonometry, etc. These were skills you developed naturally by practice, and in many cases, if you were an experienced and well-prepared contestant, whenever you were given a Geometry problem, you would be able to instantly remember related problems and ideas and figure out the rough plan of how you needed to go about solving the problem. This is completely the opposite of Combinatorics, where you did indeed sometimes have recurring patterns or broad themes,[4] but you mostly had to take each (worthwhile, IMO-level) problem on its own terms, instead of falling back on previously-memorized ideas.[5]
Rather interestingly, this phenomenon (the distinction between the aesthetic of Combinatorics being centered on "deeply understanding and solving one problem" and the rest of the fields as "applying general techniques that expand the understanding of the entirety of Geometry, Algebra, etc.") was also noted at the research level by Timothy Gowers in his excellent essay on "The Two Cultures of Mathematics."[6]
So, from this perspective, I suppose it would make a lot of sense for a rather narrow AI system to be better at areas of olympiad Math that are more easily "systematizable" (such as Algebra and Geometry) and worse at the areas that seem to require a sort of deep understanding and problem-solving (like Combinatorics) that could plausibly get it closer to general intelligence.
I think the standard example to illustrate this is problem 2 of the 2011 IMO (nicknamed the "windmill problem"), which had an average score of 1.5/7, which was roughly half of what the organizers generally aim for (meaning ~ 3/7, as in 2012), mostly because they underestimated how difficult it would be for the contestants to understand the (combinatorial) set-up properly given the time constraints.
And so was Algebra, but to a lesser extent (although contestants working on inequalities, for example, also benefitted from a boatload of books filled to the brim with dozens of techniques and proof methods specifically geared for this).
I'm not sure how such a thing could even be defined, anyway.
If there actually was no structure to Combinatorics problems, then human minds would not be able to predictably perform better at solving them over time as they practiced more and more, which is false.
To be fair, I have heard through the grapevine that this is getting less and less true as the years pass, because trainers are getting better and better at identifying and compiling books/documents, etc., for training specific combinatorial problem-solving techniques.
Of course, there is a lot of nuance here that I am not getting into, and you shouldn't come away with the impression that Combinatorics is all just one-time tricks and the rest of math is just memorizing proofs that worked on previous problems. Reality is far more complex and multi-dimensional than that; there are parts of Combinatorics that work on the basis of deep theories (such as when using generating functions or the probabilistic method or when making connections with Statistical Mechanics, etc.), and also non-Combinatorial results that seem to have arisen from an almost purely individualized assessment of the problem (David Corfield mentions arithmetic progressions among primes as an example).
I searched for it and found none. The twitter conversation also seems to imply that there has not been a paper / technical report out yet.
No, but these tweets describe the basic concept: extension of AlphaZero to train on theorems in Lean automatically converted from natural-language proofs.
This looks pretty relevant to https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer and the related manifold bet: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat
The market is surprised.
I don't understand why people aren't freaking out from this news. Waiting for the paper I guess.
I'm guessing many people assumed an IMO solver would be AGI. However this is actually a narrow math solver. But it's probably useful on the road to AGI nonetheless.
We don't know how narrow it is yet. If they did for algebra and number theory something like what they did for geometry in alphageometry (v1), providing it a well-chosen set of operations, then I'll be more inclined to agree.
People generally expected math AI to progress pretty fast already. I was angry about machine-assisted math being a neglected area before this and my anger levels aren't substantially increased by the news.
It’s funny to me that the one part of the problem the AI cannot solve is translating the problem statements to Lean. I guess it’s the only part that the computer has no way to check.
Does anyone know if “translating the problem statements” includes the providing the solution (eg “an even integer” for P1), and the AI just needs to prove the solution correct? Its not clear to me what’s human-written and what’s AI-written, and the solution is part of the “theorem” part which I’d guess is human-written.
Answer: it was not given the solution. https://x.com/wtgowers/status/1816839783034843630?s=46&t=UlLg1ou4o7odVYEppVUWoQ
Google DeepMind reports on a system for solving mathematical problems that allegedly is able to give complete solutions to four of the six problems on the 2024 IMO, putting it near the top of the silver-medal category.
Well, actually, two systems for solving mathematical problems: AlphaProof, which is more general-purpose, and AlphaGeometry, which is specifically for geometry problems. (This is AlphaGeometry 2; they reported earlier this year on a previous version of AlphaGeometry.)
AlphaProof works in the "obvious" way: an LLM generates candidate next steps which are checked using a formal proof-checking system, in this case Lean. One not-so-obvious thing, though: "The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found."
[EDITED to add:] Or maybe it doesn't work in the "obvious" way. As cubefox points out in the comments below, DeepMind's description doesn't explicitly say that it does that, and they're infuriatingly vague about how "AlphaProof generates solution candidates". The previous paragraph at least insinuates that it does it with an LLM, but it's very unclear.
(That last bit is reminiscent of something from the world of computer go: a couple of years ago someone trained a custom version of KataGo specifically to solve the infamous Igo Hatsuyoron problem 120, starting with ordinary KataGo and feeding it training data containing positions reachable from the problem's starting position. They claim to have laid that problem to rest at last.)
AlphaGeometry is similar but uses something specialized for (I think) Euclidean planar geometry problems in place of Lean. The previous version of AlphaGeometry allegedly already performed at gold-medal IMO standard; they don't say anything about whether that version was already able to solve the 2024 IMO problem that was solved using AlphaGeometry 2.
AlphaProof was able to solve questions 1, 2, and 6 on this year's IMO (two algebra, one number theory). It produces Lean-formalized proofs. AlphaGeometry 2 was able to solve question 4 (plane geometry). It produces proofs in its own notation.
The solutions found by the Alpha... systems are at https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/index.html. (There are links in the top-of-page navbar to solutions to the individual problems.)
(If you're curious about the IMO questions or want to try them yourself before looking at the machine-generated proofs, you can find them -- and those for previous years -- at https://www.imo-official.org/problems.aspx.)
One caveat (note: an earlier version of what I wrote failed to notice this and quite wrongly explicitly claimed something different): "First, the problems were manually translated into formal mathematical language for our systems to understand." It feels to me like it shouldn't be so hard to teach an LLM to convert IMO problems into Lean or whatever, but apparently they aren't doing that yet. [EDITED to add:] But it seems they are doing this to generate training data. Perhaps the fact that they didn't do it for the IMO problem statements themselves reveals something about the limitations of their English-to-Lean translator?
Another caveat: "Our systems solved one problem within minutes and took up to three days to solve the others." Later on they say that AlphaGeometry 2 solved the geometry question within 19 seconds, so I guess that was also the one that was done "within minutes". Three days is a lot longer than human IMO contestants get given, but this feels to me like the sort of thing that will predictably improve pretty rapidly.