Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

This really resonates with me. I don't work in AppSec, but I've seen how benchmark gains often fail to show up when you're doing something non-trivial with the model. It seem that current benchmarks have low ecological validity. Although I wouldn't quickly put the blame on labs possibly cheating. They may or they may not, but it also might just be that we're bad at designing evaluations that tracks real-world usefulness.

When you think about it, even university exams don't really predict job performance either. These are benchmarks we've had centuries to refine. Measuring ability is hard. Measuring reliability and adaptability alongside this is even harder. For agentic systems, tasks would mostly involve multiple iterations over longer periods of time, reviews, and some form of multi-tasking that spans across contexts. That is far beyond what current benchmarks are testing for. They're not just solving a neatly scoped problem and calling it a day.