Don't over-update on FrontierMath results

11 min read

•

How can a benchmark be more informative?

Comment Permalink

Great post, thanks for writing it; I agree with the broad point.

I think I am more or less the perfect target audience for FrontierMath results, and as I said above, I would have no idea how to update on the AIs' math abilities if it came out tomorrow that they are getting 60% on FrontierMath.

This describes my position well, too: I was surprised by how well the o3 models performed on FM, and also surprised by how hard it's to map this into how good they are at math in common sense terms.

I further have slight additional information from contributing problems to FM, but it seems to me that the problems vary greatly in guessability. E.g. Daniel Litt writes that he didn't full internalize the requirement of guess-proofness, whereas for me this was a critical design constraint I actively tracked when crafting problems. The problems also vary greatly in the depth vs. breadth of skills they require (another aspect Litt highlights). This heterogeneity makes it hard to get a sense of what 30% or 60% or 85% performance means.

I find your example in footnote 3 striking: I do think this problem is easy and also very standard. (Funnily enough, I have written training material that illustrates this particular method^[1], and I've certainly seen it writing elsewhere as well.) Which again illustrates just how hard it's to make advance predictions about which problems the models will or won't be able to solve - even "routine application of a standard-ish math competition method" doesn't imply that o3-mini will solve it.

I also feel exhaustion about how hard it's to get answer to the literal question of "how well does model X perform on FrontierMath?" As you write, OpenAI reports 32%, whereas Epoch AI reports 11%. A twenty-one percentage point difference, a 3x ratio in success rate!? Man, I understand that capability elicitation is hard, but this is Not Great.^[2]

That OpenAI is likely (at least indirectly) hill-climbing on FM doesn't help matters either^[3], and the exclusivity of the deal presumably rules out possibilities like "publish problems once all frontier models are able to solve them so people can see what sort of problems they can reliably solve".

I was already skeptical of the theory of change of "Mathematicians look at the example problems, get a feel of how hard they are, then tell the world how impressive an X% score is". But I further updated downward on this when I noticed that the very first public FrontierMath example problem (Artin primitive root conjecture) is just non-sense as stated,^[8]^[9] and apparently no one reported this to the authors before I did a few days ago.

(I'm the author of the mentioned problem.)

There indeed was a just-non-sense formula in the problem statement, which I'm grateful David pointed out (and which is now fixed on Epoch AI's website). I think flagging the problem itself as just non-sense is too strong, though. I've heard that models have tried approaches that give approximately correct answers, so it seems that they basically understood what I intended to write from the context.

That said, this doesn't undermine the point David was making about information (not) propagating via mathematicians.

^{^}
In Finnish, Tehtävä 22.3 here.
^{^}
Added on March 15th: This difference is probably largely from OpenAI reporting scores for the best internal version they have and Epoch AI reporting for the publicly available model, and that one just can't get the 32% level performance with the public version - see Elliot's comment below.
^{^}
There's been talk of Epoch AI having a subset they keep private from OpenAI, but evaluation results for that set don't seem to be public. (I initially got the opposite impression, but the confusingly-named FrontierMath-2025-02-28-Private isn't it.)

See in context

51 Don't over-update on FrontierMath results

by David Matolcsi

11th Mar 2025

11 min read

51

(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)

When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpochAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.

In retrospect, it really looks like we over-updated. While the public can't easily try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't know how much extra affordance o3 had while solving the problems, but based on OpenAI's communication, plausibly it's similar to what o3-mini (high) had when it was using a Python tool.

In spite of its great scores on FrontierMath, o3-mini (high) is nowhere close to "crushing the IMO". To the best of my knowledge, it can't solve a single IMO problem from recent years, and in my experiments it's doing somewhat worse than I was in 9th grade on the high school competitions I participated back then.^[1]^[2]^[3] Other mathematicians and people with competitive math background whom I asked have similar experience.

That's still impressive from an LLM, and the pace of progress is admittedly very fast. ^[4] Nonetheless, it's not what we originally expected when 25% on FrontierMath was announced. What causes the discrepancy?

Part of the story might be that OpenAI elicits the model's capabilities better than I do. It looks like they give it more inference time, they give it tools, and in some experiments they give the AI more than one tries. In contrast, I only experimented in the normal o3-mini (high) chat interface, only gave at most a few tries per problem, and while I experimented with a few different prompts, I might have missed some clever elicitation prompt.

Differences in elicitation certainly seem to matter. OpenAI claims that "On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems", while EpochAI's own agent scaffolding only got o3-mini (high) to solve 11% of the problems, and they "find that o3-mini (medium) has a much higher solve rate on Tier 1 problems than on Tier 2 and Tier 3", which is not in line with OpenAI's very similar reported scores for T3 and the full dataset. EpochAI comments: "Our results differ from OpenAI's reported o3-mini numbers, possibly due to OpenAI using a more powerful internal scaffold, more test-time compute, or because our eval is on the full FM set. We lack full insight into their methods."

It's possible that the great elicitation techniques that made OpenAI get 32% on FrontierMath would also make the models much better on my 9th grade competition problems, in which case many of the points I'm making in this post are less true. I'm skeptical though that OpenAI's elicitation would make that big of a difference on my problems, it really looks like the models are going in circles and giving them more inference time after a point doesn't help produce a good proof, and I can't imagine how Python tool use would help. I think it's also suggestive that OpenAI didn't brag about IMO results, which makes it more likely that they can't get good performance on it either. But still, take everything I say from now on with a grain of salt, in case OpenAI's elicitation actually makes a big difference.

(Edit: Elliot Glazer says in the comments that OpenAI's results are achieved with drastically longer reasoning time than it is available to users, and contrary to my expectations, the model is often pretty good at leveraging the longer inference time to catch and correct its mistakes and. This is an interesting update, read the rest of the post with this in mind, though I think many of my points, especially in the later sections, still stand.)

I suspect that the main cause of discrepancy between my results and FrontierMath scores is not elicitation, but the different nature of the problems. For the ease of grading, FrontierMath only uses problems where the answer is an integer or another easily describable concrete expression, like a matrix. There are competitions like this for human students too, a notable example being AIME, a test that o3-mini also excels at. I find that competitions like this, where the answer needs to an integer, usually require pretty different skills than "normal" competitions that require proofs, or the job of mathematicians. In particular, when I read through the AIME problems, I find that the main idea is almost always obvious once you've seen enough similar problems, and the difficulty comes in correctly doing the calculation in the 12 minutes on average that you can spend on a problem. It is no surprise that AIs do better on AIME than on problems that require proofs that usually need more complex ideas to solve.

Why can't we make problems where the solution is a number that also require the same creativity that we usually see in problems with proofs? In theory, I don't see a clear obstacle. In practice however, I helped in the problem selection of some math competitions that needed to have numbers as the solution, and I found that it's surprisingly hard to make creative math problems in this format. I heard the same from others working on creating math problems too.

I heard that the creators of FrontierMath tried very hard to create problems that require similar reasoning and creative abilities as normal problems with proofs. I applaud their effort, and I'm sure they managed to create many such interesting problems. Unfortunately, even if they managed to make many of their problems interesting and creative, I suspect that many of their other problems fall into similar uninteresting patterns than the AIME problems or the problems I tried to create for other competitions that needed a number as their solution. At least, that's the only way I can explain how o3-mini (high) (with tool assistance) gets 32% on FrontierMath, while still falling short of me in 9th grade on competition problems with proofs.

It's also notable that in OpenAI's tests, o3-mini-high achieves 28% of the supposedly challenging T3 problem set on FrontierMath, barely below the average performance. I think this strengthens my belief that FrontierMath doesn't give a really good feel of "how good a mathematician the AI is". If it gets similar scores on IMO level and research level problems, then is it as good as a smart high schooler or a researcher? My guess is neither, it's just a similar percentage of problems in the two datasets turned out to be unexpectedly idea-less and simple for the AI.^[5]

All of this makes it very hard to interpret FrontierMath scores for the public who can't see the dataset. If tomorrow o4 came out and achieved 60% score on FrontierMath, how should I update based on that? Did it just improve from 8th grade to 9th grade level in proof-requiring math? Or did o3-mini already burn through most of the problems that turned out to be unexpectedly simple, and the jump from 30% to 60% means that o4 started doing much smarter things and is now at researcher level in proofs? I have no way of knowing, and the only thing I could do is to wait for the new model to be released so I can test it on a few problems myself.

Given the fact that even the creators of FrontierMath were caught in surprise by the models getting 32% on FrontierMath while being very bad on the IMO, I don't think anyone can reliably predict what the transfer will be between higher scores on FrontierMath and better performance on problems with proofs.

What is the purpose of benchmarks?

The reason the AI safety community is interested in creating benchmarks is usually something like this:

There is an important type of capability, and the AI safety community wants to inform the world ^[6] about how AIs are progressing in it. This needs to be quantified to be measurable, so they create a benchmark that intends to represent the relevant capability as closely as possible. Then people measure new AIs against the benchmark, and hope that this new information will lead to positive results.

Maybe decision-makers will find it scary when they see how strong the AIs are, and will take AI risk more seriously. If they won't find it scary, at least maybe they will have better awareness of where AIs are now, and it hopefully helps them make better decisions. Maybe there will be red-lines and If-then commitments conditioned on reaching certain capability levels. Maybe if we draw a line through the data-points we have, we can make a reasonable forecast on when the AIs will be so capable that it would require certain actions.

The story for how the existence of benchmarks leads to good outcomes is more straightforward for dangerous capability evaluations, though some ^[7] are skeptical even there that it has much of a positive effect.

It's less clear why it's good from the perspective of AI safety to create a benchmark for a general capability like mathematics, that's not strongly tied to any risk model. It probably accelerates AI improvement by providing a metric to iterate against, while it's unclear which direction a cool result on a math benchmark moves the public discourse.

Still, there is something to be said for the position that being better informed about the world is just good in general, and we should just measure and honestly report all sorts of capabilities. Selfishly, I certainly find it useful to look at AI capabilities on math, as this is the domain I'm most comfortable in, so this is what gives me the most intuitive "feel" of how smart the AIs currently are. And even more selfishly, I'd like to know when I should alert my math contest organizer friends that cheating is becoming easier, and when I should alert my mathematician friends that they are going to lose their jobs.

Unfortunately, even if we grant that it's good to be well-informed about how good AIs are at math, FrontierMath basically doesn't help in that. I think I am more or less the perfect target audience for FrontierMath results, and as I said above, I would have no idea how to update on the AIs' math abilities if it came out tomorrow that they are getting 60% on FrontierMath.

Unlike earlier, simpler benchmarks, I can't just look at a few example problems and get a feel of how impressive it is to get X% on the benchmarks. Most of the example problems involve too much background knowledge that I lack, so I can't assess how much reasoning ability it takes to solve them once someone knows the relevant field.

I was already skeptical of the theory of change of "Mathematicians look at the example problems, get a feel of how hard they are, then tell the world how impressive an X% score is". But I further updated downward on this when I noticed that the very first public FrontierMath example problem (Artin primitive root conjecture) is just non-sense as stated,^[8]^[9] and apparently no one reported this to the authors before I did a few days ago. Spotting that the problem doesn't make sense took 5 minutes, and the example problems have been publicly available for 4 months now, which puts an upper bound on how many mathematicians scrutinized the public example problems enough to have a good feel of their nature and difficulty.

This means that I don't have a good intuition of the difficulty of FrontierMath problems, and apparently no one else in the public really does either, so we can't really tell what it means for an AI to do well on the benchmark. And I can't tell my friends how useful o3-mini is for cheating on the competitions they organize based on FrontierMath results, and I certainly can't tell them how much I expect the AIs to help them in their jobs as mathematicians. The transfer between FrontierMath scores and math involving proof (the only type of math that mathematicians actually care about and have some intuitive feeling of)^[10], is just too weak and uncertain for that.

This means that I really don't know what to do with FrontierMath result. I guess it's useful to compare existing models, o3-mini (high) getting higher score than Claude-3.7 probably reflects a real difference in math capabilities, and it can be a useful strategic information to know which companies are currently on top. Still, it's kind of sad if the only use the AI safety community gets out of a benchmark is to use it as a contest among companies.

How can a benchmark be more informative?

I don't want to single out FrontierMath as unusually bad in this regard. I only write about it because I was originally unusually excited about this benchmark, assembled with great work and care, focusing on a domain that's especially close to me. In the case of some other benchmarks, like Humanity's Last Exam, I have even harder time seeing what it is supposed to inform me about, or how I could update on its results.^[11]

How could we make benchmarks that are more informative to people who don't have a very intimate feel of the data-set and are not extremely knowledgeable about the domain subject?

While I have no personal experience creating benchmarks, and I'm fully aware that these things are always easier said than done, here are a few framings I would suggest:

Focus on capabilities which have a clear path to danger or dramatic consequences. If you can demonstrate how much progress AIs make towards making bioweapons or automating the job of OpenAI's engineers, it's pretty straightforward to explain why people should care about these capabilities.
Try to anchor to existing jobs and competitions. People (at least if they are somewhat socially close to these jobs and competitions) have some intuitive feel of how impressive it is to be an IMO medalist or a Google programmer. So if you can measure the AIs performance on prestigious competitions, or real life tasks from a company's workflow, that can give some idea on how good the AI is and how fast it's progressing.^[12]
Report easily understandable metrics, even if the definitions need to be a little handwave-y.
1. I really like work that tries to determine what's the longest time horizon task on which an AI is competitive with humans. Even for people who don't have domain expertise and don't have an intuitive feel of the hardness of the example problems, it's easy to look at the trend line and see the pace of progress. Indeed, that's how I usually explain AI progress to my friends.
2. Analogies with age and academic level are often sloppily conflating knowledge with intelligence, but I still believe it's a reasonable framing of AI development that people can easily understand. That's how I mentally track the AIs' improvements in math: o1 and o3-mini (high) feel about as good as I was in 8th grade, o1-preview was 7th grade, etc.
3. Human speed-ups are a great metric though vary a lot between people and tasks. Still, I think AIs are good enough that they can accelerate most professionals in most domains at least a little, so we can start measuring how this acceleration increases, and fit trend-lines to it.
If possible, test reasoning and other interesting capabilities with problems requiring minimal background knowledge. Thus, smart amateurs have an easier time evaluating how impressive the example problems from the dataset are.
In mathematics, use proofs instead of numerical solutions. It's harder to grade and a little less objective, but I think it should be doable, and it gives better information on the AIs' progress in the type of math that mathematicians actually care about.

PS: I want to reiterate that I really like Epoch's work in general, and they did a cool job with FrontierMath too, I'm just expressing my sadness that it's hard to gain useful information from the results due to issues that unfortunately affect most other existing benchmarks too.

^{^}
For calibration, I later became an IMO silver medalist, so I was pretty good, though it's unlikely I would have made the IMO cut if I lived in the US instead of the much smaller Hungary.
^{^}
I mostly tested the AIs on problems from the Hungarian high school competition, KöMaL, which has a great collection of problems of various difficulty levels, with new ones coming out every month. (I recommend this resource to everyone interested in competitive math, every problem is available in English now.) In my experiments, o3-mini (high) does about as well as the best student of Hungary would do in 8th grade. In contest B, the one that's the appropriate difficulty for current AIs, students have one months to solve 6 problems out of the 8 given that month. The AIs currently solve like 40% of 5 points problems and very few of the hardest 6 points problems. I think in 9th grade I usually spent around an hour on average on 5 points problems, but almost always managed to solve them.
^{^}
An example of a 5 points problem from KöMaL from this year's December: "Jumpy, the grasshopper is jumping around on the positive integers of the number line, visiting each exactly once. Is it possible that the lengths of his jumps produce every positive integer exactly once?" I couldn't make o3-mini (high) solve this problem, however many times I tried, in fact it got even the final answer consistently wrong. In contrast, four 9th graders and one 8th grader student from Hungary solved the problem in December.
^{^}
Though interestingly, in my experiments o3-mini (high) wasn't noticeably smarter than o1. However, both were significantly smarter than o1-preview, and much smarter than GPT-4o.
^{^}
A necessary caveat is that these scores come from OpenAI's tests with unknown methodology, and in EpochAI's tests, the models get significantly worse scores of T3 than on T1.
^{^}
And the AI safety community itself
^{^}
I remember reading a quite well-articulated post making the point that evals are useless because we learned that the public and politicians don't care, but I can't find it anywhere. If someone finds it, I will link to it.
^{^}
By the current definition, $S_x$ is probably an infinite set, which makes the problem non-sense. If you assume they wanted to say that $S_x$ only consists of primes below $x$, the problem starts to have a meaning, but then the truncation at $x$ in the formula becomes superfluous and it becomes $ord_{p,x}(a)=ord_p(a)$, which is not what they intended. When I contacted Olli, he told me that the mistake is that the definition of $d_x$ was supposed to be
This is really different from what's currently in the text:
$d_{x} := \frac{| S_{x} |}{| p \leq x : p prime |},$
and I think it's hard to figure out what the intended definition was based on the current problem statement, I certainly couldn't.
^{^}
I don't want to criticize the authors too much here, I know well enough from being problem selector and creator on math competitions that it's easy to make mistakes in the text of a problem.
^{^}
Of course, other engineering-heavy fields care a lot about mathematics that only needs to get the result right and doesn't involve proofs. Math skills relevant to engineering are also interesting to measure, though my impression is that the AIs already excel at every math that's needed for engineering. But I certainly wouldn't try to measure these skills with problems involving Artin's primitive root conjecture. FrontierMath aims to evaluate capabilities relevant to mathematicians, so it should be assessed from the mathematicians' perspective.
^{^}
A friend told me that the point could be that even though it's hard to have a feeling of what HLE is measuring and what the scores mean, we can at least tell the world once it's saturated that "We tried really hard to make a very hard benchmarks and the AI got a very high score even on that, the age of benchmarks really is over". Which is fair enough, this can be a useful message to communicate.
^{^}
Even this is tricky though. If you didn't do competitive math yourself, would you know that you should be way, way more impressed by an AI solving IMO level combinatorics problems than IMO level geometry problems? I think it's just really hard to make the correct updates on AI progress based on results in a field you are not intimately familiar with.

^{^}

In Finnish, Tehtävä 22.3 here.

^{^}

Added on March 15th: This difference is probably largely from OpenAI reporting scores for the best internal version they have and Epoch AI reporting for the publicly available model, and that one just can't get the 32% level performance with the public version - see Elliot's comment below.

^{^}

There's been talk of Epoch AI having a subset they keep private from OpenAI, but evaluation results for that set don't seem to be public. (I initially got the opposite impression, but the confusingly-named FrontierMath-2025-02-28-Private isn't it.)

^{^}

By the current definition, $S_x$ is probably an infinite set, which makes the problem non-sense. If you assume they wanted to say that $S_x$ only consists of primes below $x$, the problem starts to have a meaning, but then the truncation at $x$ in the formula becomes superfluous and it becomes $ord_{p,x}(a)=ord_p(a)$, which is not what they intended. When I contacted Olli, he told me that the mistake is that the definition of $d_x$ was supposed to be

This is really different from what's currently in the text:

$d_{x} := \frac{| S_{x} |}{| p \leq x : p prime |},$

and I think it's hard to figure out what the intended definition was based on the current problem statement, I certainly couldn't.

^{^}

I don't want to criticize the authors too much here, I know well enough from being problem selector and creator on math competitions that it's easy to make mistakes in the text of a problem.

^{^}

For calibration, I later became an IMO silver medalist, so I was pretty good, though it's unlikely I would have made the IMO cut if I lived in the US instead of the much smaller Hungary.

^{^}

I mostly tested the AIs on problems from the Hungarian high school competition, KöMaL, which has a great collection of problems of various difficulty levels, with new ones coming out every month. (I recommend this resource to everyone interested in competitive math, every problem is available in English now.) In my experiments, o3-mini (high) does about as well as the best student of Hungary would do in 8th grade. In contest B, the one that's the appropriate difficulty for current AIs, students have one months to solve 6 problems out of the 8 given that month. The AIs currently solve like 40% of 5 points problems and very few of the hardest 6 points problems. I think in 9th grade I usually spent around an hour on average on 5 points problems, but almost always managed to solve them.

^{^}

An example of a 5 points problem from KöMaL from this year's December: "Jumpy, the grasshopper is jumping around on the positive integers of the number line, visiting each exactly once. Is it possible that the lengths of his jumps produce every positive integer exactly once?" I couldn't make o3-mini (high) solve this problem, however many times I tried, in fact it got even the final answer consistently wrong. In contrast, four 9th graders and one 8th grader student from Hungary solved the problem in December.

^{^}

Though interestingly, in my experiments o3-mini (high) wasn't noticeably smarter than o1. However, both were significantly smarter than o1-preview, and much smarter than GPT-4o.

^{^}

A necessary caveat is that these scores come from OpenAI's tests with unknown methodology, and in EpochAI's tests, the models get significantly worse scores of T3 than on T1.

^{^}

And the AI safety community itself

^{^}

I remember reading a quite well-articulated post making the point that evals are useless because we learned that the public and politicians don't care, but I can't find it anywhere. If someone finds it, I will link to it.

^{^}

This is really different from what's currently in the text:

$d_{x} := \frac{| S_{x} |}{| p \leq x : p prime |},$

and I think it's hard to figure out what the intended definition was based on the current problem statement, I certainly couldn't.

^{^}

I don't want to criticize the authors too much here, I know well enough from being problem selector and creator on math competitions that it's easy to make mistakes in the text of a problem.

10.

^{^}

Of course, other engineering-heavy fields care a lot about mathematics that only needs to get the result right and doesn't involve proofs. Math skills relevant to engineering are also interesting to measure, though my impression is that the AIs already excel at every math that's needed for engineering. But I certainly wouldn't try to measure these skills with problems involving Artin's primitive root conjecture. FrontierMath aims to evaluate capabilities relevant to mathematicians, so it should be assessed from the mathematicians' perspective.

11.

^{^}

A friend told me that the point could be that even though it's hard to have a feeling of what HLE is measuring and what the scores mean, we can at least tell the world once it's saturated that "We tried really hard to make a very hard benchmarks and the AI got a very high score even on that, the age of benchmarks really is over". Which is fair enough, this can be a useful message to communicate.

12.

^{^}

Even this is tricky though. If you didn't do competitive math yourself, would you know that you should be way, way more impressed by an AI solving IMO level combinatorics problems than IMO level geometry problems? I think it's just really hard to make the correct updates on AI progress based on results in a field you are not intimately familiar with.

^{^}

In Finnish, Tehtävä 22.3 here.

^{^}

This is really different from what's currently in the text:

$d_{x} := \frac{| S_{x} |}{| p \leq x : p prime |},$

and I think it's hard to figure out what the intended definition was based on the current problem statement, I certainly couldn't.

^{^}

I don't want to criticize the authors too much here, I know well enough from being problem selector and creator on math competitions that it's easy to make mistakes in the text of a problem.

AI1

Frontpage

51

Don't over-update on FrontierMath results

New Comment

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:45 PM

[-]Olli Järviniemi12d*91

Great post, thanks for writing it; I agree with the broad point.

I think I am more or less the perfect target audience for FrontierMath results, and as I said above, I would have no idea how to update on the AIs' math abilities if it came out tomorrow that they are getting 60% on FrontierMath.

This describes my position well, too: I was surprised by how well the o3 models performed on FM, and also surprised by how hard it's to map this into how good they are at math in common sense terms.

I was already skeptical of the theory of change of "Mathematicians look at the example problems, get a feel of how hard they are, then tell the world how impressive an X% score is". But I further updated downward on this when I noticed that the very first public FrontierMath example problem (Artin primitive root conjecture) is just non-sense as stated,^[8]^[9] and apparently no one reported this to the authors before I did a few days ago.

(I'm the author of the mentioned problem.)

That said, this doesn't undermine the point David was making about information (not) propagating via mathematicians.

^{^}
In Finnish, Tehtävä 22.3 here.
^{^}
Added on March 15th: This difference is probably largely from OpenAI reporting scores for the best internal version they have and Epoch AI reporting for the publicly available model, and that one just can't get the 32% level performance with the public version - see Elliot's comment below.
^{^}
There's been talk of Epoch AI having a subset they keep private from OpenAI, but evaluation results for that set don't seem to be public. (I initially got the opposite impression, but the confusingly-named FrontierMath-2025-02-28-Private isn't it.)

[-]Elliot Glazer14d40

A quick comment: the o3 and o3-mini announcements each have two significantly different scores, one <= 10%, the other >= 25%. Our own eval of o3-mini (high) got a score of 11% (it's on Epoch's Benchmarking Hub). We don't actually know what the higher scores mean, could be some combination of extreme compute, tool use, scaffolding, majority vote, etc., but we're pretty sure there is no publicly accessible way to get that level of performance out of the model, and certainly not performance capable of "crushing IMO problems."

I do have the reasoning traces from the high-scoring o3-mini run. They're extremely long, and one of the ways it leverages the higher resources is to engage in an internal dialogue where it does a pretty good job of catching its own errors/hallucinations and backtracking until it finds a path to a solution it's confident in. I'm still writing up my analysis of the traces and surveying the authors for their opinions on the traces, and will also update e.g. my IMO predictions with what I've learned.

[-]David Matolcsi12d20

Thanks a lot for the answer, I put in an edit linking to it. I think it's a very interesting update that the models get significantly better at catching and correcting their mistakes in OpenAI's scaffold with longer inference time. I am surprised by this, given how much it feels like the models can't distinguish its plausible fake reasoning from good proofs at all. But I assume there is still a small signal in the right direction, and that can be amplified if the model think the question through a lot of times (and does something like a majority voting within its chain of thought?). I think this is an interesting update towards the viability of inference time scaling.

I think many of my other points still stand however: I still don't know how capable I should expect the internally scaffolded model to be given that it got 32% on FrontierMath, and I would much rather have them report results on the IMO or a similar competition, than on a benchmark I can't see and whose difficulty I can't easily assess.

[-]Elliot Glazer12d21

Yes, the privacy constraints make the implications of these improvements less legible to the public. We have multiple plans for how to disseminate info within this constraint, such as publishing author survey comments regarding the reasoning traces and our competition at the end of the month to establish a sort of human baseline.

Still, I don't know that the privacy of FrontierMath is worth all the roundabout efforts we must engage in to explain it. For future projects, I would be interested in other approaches to balancing preventing models from training on public discussion of problems vs being able to clearly show the world what the models are tackling. Maybe it would be feasible to do IMO-style releases? "Here's 30 new problems we collected this month. We will immediately test all the models and then make the problems public."

[-]David Matolcsi10d30

I like the idea of IMO-style releases, always collecting new problems, testing the AIs on them, then releasing to the public. What do you think, how important it is to only have problems with numerical solutions? If you can test the AIs on problems with proofs, then there are already many competitions that regularly release high-quality problems. (I'm shilling KöMaL again as one that's especially close to my heart, but there are many good monthly competitions around the world.) I think if we instruct the AI to present its solution in one page at the end, then it's not that hard to get an experience competition grader to read the solution and give it scores according to the normal competitions scores, so the result won't be much less objective than if it was only numerical solutions. If you want to stick to problems with numerical solutions, I'm worried that you will have a hard time regularly assembling high-quality numerical problems again and again, and even if the problems are released publicly, people will have a harder time evaluating them than if they actually came from a competition where we can compare to the natural human baseline of the competing students.

[+][comment deleted]13d10

Deleted by notfnofn, 03/12/2025

Reason: Elliot's comment is now visible

Moderation Log