People aren't properly calibrated on FrontierMath
As most of you know, openAI has recently showcased o3's SOTA results on various benchmarks. In my opinion FrontierMath was the hardest of the bunch, and it was reflected in model performance as the SOTA was 2% before Friday. It also seems to be the benchmark with the least visibility....
I don't believe this is true, actually! What do you mean by "resolve the conjecture"? If you mean write up with a proof of it, then of course you can write a turing machine that will write... (read more)