OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
- Epoch built the benchmark and has better incentives.
- OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
- Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
Edited in Addendum:
Epoch has this to say in their FAQ:
The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in
frontiermath-2024-11-26
vs the 290 problems infrontiermath-2025-02-28-private
).
- ^
Which had Python access.
Fixed the link.
IMO that's plausible but it would be pretty misleading since they described it as "o3-mini with high reasoning" and had "o3-mini (high)" in the chart and o3-mini high is what they call a specific option in ChatGPT.