(disclaimer: I work on evaluation at oai, run the o3 evaluations etc)
I think you are saying “bullshit” when you mean “narrow”. The evidence for large capability improvements in math and tightly scoped coding since 4o is overwhelming, see eg AIME 2025, Gemini USAMO, copy paste a recent codeforces problem etc.
The public evidence for broad/fuzzy task improvement is weaker — o1 mmlu boosts and various vibes evals (Tao) do show it though.
It is a very important question how much these large narrow improvements generalize. I try and approach the question humbly.
Hopefully new models improve on your benchmark — do share if so!
Is there an o3 update yet?