Thanks for writing this. I disagree with some of the claims in this post but, as one of the authors on the original paper, I've been meaning to write a post on ways time horizon is overrated/misinterpreted. Limited number and distribution of tasks is definitely near the top of the list.
Lower R^2 is a natural consequence of the x axis being shorter, in an experimental setup with both signal and noise. In the extreme, imagine that we only benchmarked models on 47 minute tasks. AIs will naturally do better at some 47 minute tasks than others. Since R^2 is variance explained by the correlation with x, and x is always 47 minutes, all the variance in task success rate will be due to other factors and the R^2 will be zero.
In fact this is why we constructed SWAA-- to increase the R^2 in worlds where the trend is real. In a world where models were not actually better at tasks with shorter human time, the R^2 would be lower when you add more data, so it's a very scientifically normal thing to do.
What would you have done, just not mention the person works at mechanize? I happen to know someone who works there and had this convo in person, not due to the twitter mind virus. And I think people should know why the claim in the post is relevant; omitting it would be omitting important context.
Unfortunately the available benchmark tasks do not allow for 99%+ reliability measurements. Because we don't have 1,000 different one-minute tasks the best we could do would be something like whether GPT5.1 can do all 40 tasks 25 times each with perfect reliability. Most likely it will succeed at all of them because we just don't have a task that happens to trip it up.
As for humans' 99.9%, at a granular enough level it would be 0.2 seconds (typing one keystroke) because few people have higher than 99.9% accuracy. But in the context of a larger task, we can correct our typos, so it isn't super relevant.
I knew this guy to be operating in good faith, and the point is a meaningful caveat to the model that safety/capabilities investment ratio is what matters
I talked to someone at an AI company [1] who thought safety would be net unaffected by timelines. The argument was:
However, even if 1-3 are true, I claim 4 doesn't follow.
4 only holds if the time lag decreases in lockstep with increasing investment. However, this seems false because most ways capabilities would speed up involve more parallel labor/compute or faster algorithmic progress, not faster feedback loops from the real world that would transfer to safety.
By analogy, suppose that the speed of cars increased from 20mph to 100mph in one year in 1910. Even if the same technology sped up the invention of airbags, ABS, etc. to 2010 levels, more people would die in car crashes because they couldn't be installed until the next model year [2], unless companies proactively did real-world testing and delayed the release of the super fast car until such safety features could be invented and integrated, which would be a huge competitive disadvantage. Likewise, if cyberattacks/power-seeking/etc are solvable but require real-world data and fixes only make it into the next model release, immediately getting a 2035-era superintelligence plus 10 years of safety research will result in way more cyberattacks and power-seeking.
[1] Mechanize, specifically
[2] Retooling Ford factories was actually super expensive, I read somewhere that it required shutting down factories for six months.
We can use the number of mistakes to get a very noisy estimate of Claude 4.5 Sonnet's coffee time horizon. By my count, Claude made three unrecoverable mistakes that required human assistance:
Now this was a "try until success" task rather than a success/failure task. But if we try to apply the same standards as the METR benchmark, the task needs to be economically valuable (so includes adding milk/sugar) and any mistake that would make it non-viable to automate should count as a failure. I think any robot butler that typically made one of these mistakes would be unemployable.
I'd guess an experienced human would take about 7 minutes to make coffee in an unfamiliar house if they get the milk and sugar ready while the kettle is boiling, so we get a rate of 1 failure every 2.3 human minutes, which means a 50% chance of success would occur around ln(2) * 2.3 = 1.6 minutes. Of course, this is just one task, but we already know its coffee time horizon isn't like 20 minutes-- the probability of three events from a Poisson process with rate ln(2) / 20 is only 0.2%. Claude says the 95% confidence interval is (33 seconds, 8 minutes).
This is below trend for RLBench, though the data is extremely bad. If I speculate anyway, maybe real-world tasks like coffee are harder than RLBench or OSWorld-- it certainly requires much more planning than 5-20 second simulated robotics tasks. Or maybe it just hasn't been trained for the real world.
METR could probably use a methodology like this if we had more long tasks and labeling were free, so maybe it's worth looking into methodologies like having smarter agents unblock dumber agents where we can automate things.
Tagging @Megan Kinniment who has also thought about recoverable and unrecoverable failures
I'm spending about 1/4 of my time thinking about how to best get data on this and predict whether we're heading for a software intelligence explosion. For now, one thought is that the inference scaling curve is more likely to be a power law, because it's scalefree and consistent with a world where AIs are prone to get stuck when doing harder tasks, but get stuck less and less as their capability increases.
My current guess is still something like the independent-steps model which has a power law.
If you want it to be default, LW should enable it by default with a checkbox for "Hide score"
If you don't emotionally believe in enough uncertainty to use normal reasoning methods like "what else has to go right for the future to go well and how likely does that feel", or "what level of superintelligence can this handle before we need a better plan", and you want to think about the end to end result of an action, and you don't want to use explicit math or language, I think you're stuck. I'm not aware of anyone who has successfully used the dignity frame-- maybe habryka? It seems to replace estimating EV with something much more poorly defined which, depending on your attitude towards it, may or may not be positively correlated with what you care about. I also think doing this inner sim end-to-end adds a lot more noise than just thinking about whether the action accomplishes some proximal goal.