A quick comment: the o3 and o3-mini announcements each have two significantly different scores, one <= 10%, the other >= 25%. Our own eval of o3-mini (high) got a score of 11% (it's on Epoch's Benchmarking Hub). We don't actually know what the higher scores mean, could be some combination of extreme compute, tool use, scaffolding, majority vote, etc., but we're pretty sure there is no publicly accessible way to get that level of performance out of the model, and certainly not performance capable of "crushing IMO problems."
I do have the reasoning trac...
Yes, the privacy constraints make the implications of these improvements less legible to the public. We have multiple plans for how to disseminate info within this constraint, such as publishing author survey comments regarding the reasoning traces and our competition at the end of the month to establish a sort of human baseline.
Still, I don't know that the privacy of FrontierMath is worth all the roundabout efforts we must engage in to explain it. For future projects, I would be interested in other approaches to balancing preventing models from training o... (read more)