Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)
This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.
It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.
It is substantially a list of lists structuring 800 links. The point is to produce stylised... (read 24743 more words →)
This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness.
When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to refine. Measuring ability is hard. Measuring reliability and adaptability alongside this is even harder. For agentic systems, tasks would mostly involve multiple iterations over longer periods of time, reviews, and some form of multi-tasking that spans across contexts. That is far beyond what current benchmarks are testing for. They're not just solving a neatly scoped problem and calling it a day.