OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
- Epoch built the benchmark and has better incentives.
- OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
- Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
Edited in Addendum:
Epoch has this to say in their FAQ:
The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in
frontiermath-2024-11-26
vs the 290 problems infrontiermath-2025-02-28-private
).
- ^
Which had Python access.
My strong guess is that OpenAI's results are real, it would really surprise me if they were literally cheating on the benchmarks. It looks like they are just using much more inference-time compute than is available to any outside user, and they use a clever scaffold that makes the model productively utilize the extra inference time. Elliot Glazer (creator of FrontierMath) says in a comment on my recent post on FrontierMath: