nmca

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Recent AI model progress feels mostly like bullshit

nmca1mo31

Is there an o3 update yet?

Reply

Recent AI model progress feels mostly like bullshit

nmca1mo122

(disclaimer: I work on evaluation at oai, run the o3 evaluations etc)

I think you are saying “bullshit” when you mean “narrow”. The evidence for large capability improvements in math and tightly scoped coding since 4o is overwhelming, see eg AIME 2025, Gemini USAMO, copy paste a recent codeforces problem etc.

The public evidence for broad/fuzzy task improvement is weaker — o1 mmlu boosts and various vibes evals (Tao) do show it though.

It is a very important question how much these large narrow improvements generalize. I try and approach the question humbly.

Hopefully new models improve on your benchmark — do share if so!

Reply