cakubilo

People aren't properly calibrated on FrontierMath

As most of you know, openAI has recently showcased o3's SOTA results on various benchmarks. In my opinion FrontierMath was the hardest of the bunch, and it was reflected in model performance as the SOTA was 2% before Friday. It also seems to be the benchmark with the least visibility....

Dec 23, 202431

LESSWRONG
LW

LESSWRONG
LW

cakubilo

cakubilo

People aren't properly calibrated on FrontierMath

cakubilo

cakubilo

cakubilo

People aren't properly calibrated on FrontierMath