Epistemic status: I have just skimmed through OpenAI's blogpost and paper, I do not fully understand the details.
From the blogpost
We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO.
[...]
The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.
From the paper
We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we achieve state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.
Method
- Uses the Lean formal environment instead of the Metamath used in GPT-f.
- Uses "decoder-only Transformers similar to GPT-3" with 774M trainable parameters
- Pre-trained "successively on GPT-3’s postprocessed version of CommonCrawl (for 300B tokens) and an updated version of WebMath (for 72B tokens)"
- "proof search interleaved with learning"
The two IMO-adapted problems
Problem 1: Suppose a, b, c are the sides of a triangle. Prove that a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) ≤ 3abc.
Problem 2: For a, b, c reals, prove that (a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) ≥ (ab+bc+ ca)^3.
Both solutions to those problems use "nlinarith" applied to the right arguments, which, as far as I understand, is a tactic from mathlib for solving nonlinear arithmetic problems by adding more assumptions to the context of the solver. (source)
The right arguments for the first problem are said in the blogpost to come (informally) from Schur's inequality, which gives
nlinarith [sq_nonneg (b - a),
sq_nonneg (c - b),
sq_nonneg (c - a)]
The second problem is solved by applying the Cauchy-Schwarz multiple times, then using some inequality it "invented", and ends up with the same nlinarith expression above.
Related bets and forecasts
- On Metaculus, the question AI Wins IMO Gold Medal has for community Prediction Dec 26, 2032 and Metaculus Prediction (different weighting) Apr 3, 2035.
- In the comments of Yudkowsky and Christiano discuss takeoff speeds, Christiano ends up with a 8% chance of "For the 2022, 2023, 2024, or 2025 IMO an AI built before the IMO is able to [get gold]" (see also this comment).
Most of the heavy lifting in these proofs seem to be done by the Lean tactics. The comment "arguments to
nlinarith
are fully invented by our model" above a proof which is literally the single linenlinarith [sq_nonneg (b - a), sq_nonneg (c - b), sq_nonneg (c - a)]
makes me feel like they're trying too hard to convince me this is impressive.The other proof involving multiple steps is more impressive, but this still feels like a testament to the power of "traditional" search methods for proving algebraic inequalities, rather than an impressive AI milestone. People on twitter have claimed that some of the other problems are also one-liners using existing proof assistant strategies - I find this claim totally plausible.
I would be much more impressed with an AI-generated proof of a combinatorics or number theory IMO problem (eg problem 1 or 5 from 2021). Someone with more experience in proof assistants probably has a better intuition for which problems are hard to solve with "dumb" searching like
nlinarith
, but this is my guess.I think we're basically in agreement here (and I think your summary of the results is fair, I was mostly pushing back on a too-hyped tone coming from OpenAI, not from you)