OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
- Epoch built the benchmark and has better incentives.
- OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
- Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
Edited in Addendum:
Epoch has this to say in their FAQ:
The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in
frontiermath-2024-11-26
vs the 290 problems infrontiermath-2025-02-28-private
).
- ^
Which had Python access.
I'm confused about the following: o3-mini-2025-01-31-high scores 11% on FrontierMath-2025-02-28-Private (290 questions), but 40% on FrontierMath-2025-02-28-Public (10 questions). The latter score is higher than OpenAI's reported 32% on FrontierMath-2024-11-26 (180 questions), which is surprising considering that OpenAI probably has better elicitation strategies and is willing to throw more compute at the task. Is this because:
a) the public dataset is only 10 questions, so there is some sampling bias going on
b) the dataset from 2024-11-26 is somehow significantly harder