Reasons time horizon is overrated and misinterpreted:
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather whatever I thought of in a few hours.
* Time horizon is not the length of time AIs can work independently
* Rather, it’s the amount of serial human labor they can replace with a 50% success rate. When AIs solve tasks they’re usually much faster than humans.
* Time horizon is not precise
* When METR says “Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins)”, we mean those error bars. They were generated via bootstrapping, so if we randomly subsample harder tasks our code would spit out <1h49m 2.5% of the time. I really have no idea whether Claude’s “true” time horizon is 3.5h or 6.5h.
* Error bars have historically been a factor of ~2 in each direction, worse with current models like Opus 4.5 as our benchmark begins to saturate.
* Because model performance is correlated, error bars for relative comparisons between models are a bit smaller. But it still makes little sense to care about whether a model is just below frontier, 10% above the previous best model, or 20% above.
* Time horizon differs between domains by orders of magnitude
* The original paper measured it on mostly software and research tasks. Applying the same methodology in a follow-up found that time horizons are fairly similar for math, but 40-100x lower for visual computer use tasks, due to eg poor perception.
* Claude 4.5 Sonnet’s real-world coffee-making time horizon is only ~2 minutes
* Time horizon does not apply to every task distribu