The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.
So it seems like we're getting 2028 performance on the MATH dataset already in 2024.
Quote from the blog post:
"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."
I don't know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.
In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using). It's been slower than expected and I wouldn't have guessed some less formal setting like o1 to go relatively well - but since then I just nod along to these kinds of results.
(Not sure what to think of that claimed 95% number though - wouldn't that kind of imply they'd blown past the IMO grand challenge? EDIT: There were significant time limits on the human participants, see Qumeric's comment.)
A nice test might be the 2024 IMO (from July). I'm curious to see if it's reached gold medal performance on that.
The IMO Grand Challenge might be harder; I don't know how Lean works, but it's probably harder to write than human-readable LaTeX.
The rate of progress is surprising even to experts pushing the frontier... Another example: https://x.com/polynoamial/status/998902692050362375