Given that humans are our only existing example of decent agents, I think one obvious sanity check for proposed measures of AI agency is whether they are helpful for characterizing variation in human agency.
This seems like an obvious and apt question to ask, but I don't think it's an obvious sanity check, in the sense that "if a measure doesn't pass this check, that's a strong sign that it's not capturing what we care about."
AIs are different than human minds! I think it's not at all surprising if they have different limiting constraints and therefore very different "capability profiles."
Like, for humans, working memory is an important constraint on many of the complicated intellectual operations that we do. And working memory correlates with overall cognitive ability.
When you try to measure human intelligence , and figure out what it is made of, working memory is something like one of the major factors that falls out of the factor analysis.
But if we imagine aliens that have vastly larger working memories (or "context windows") than humans. These aliens might still vary in working memory capacity, but it might be close to irrelevent for predicting their overall cognitive performance, because the bottlenecks on their cognitive ability are something else entirely.
I think that's exactly the situation we're in with the AIs. Their minds are of a quite different shape than ours, and so good proxy metrics for human capability won't generalize to AIs or vis versa.
Overall, great post.
My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.
The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:
Since our tasks do not perfectly represent the average segment of intellectual labor by researchers and software engineers, this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks. We include results from three supplementary external validity experiments.
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I'm not Adam, but my response is "No", based on the description Megan copied in thread and skimming some of the paper. It's good that the paper includes those experiments, but they don't really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Do the experiments in Sec 6 deal with this?
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) ...However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don't have very messy tasks.
c. SWE-Bench Verified: doesn't speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they're still clearly verifiable and still software engineering.
I do think Thomas and Vincent's follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
We care about the performance prediction at a given point in time for skills like "take over the world", "invent new science", and "do RSI" (and "automate AI R&D", which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent's follow up work, it seems like we're facing down at least three problems:
So my overall take is that I think the current work I'm aware of tells us
I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it's also quite unclear how scaling laws should constrain our expectations about timelines.
(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force).
fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force
This seems inaccurate to me. Here's the introduction to the external validity and robustness section of the paper:
To investigate the applicability of our results to other benchmarks, and to real task distributions, we performed four supplementary experiments. First, we check whether the 2023–2025 trend without the SWAA dataset retrodicts the trend since 2019, and find that the trends agree. Second, we label each of our tasks on 16 “messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks and (2) be relevant to AI agent performance. Third, we calculate AI agent horizon lengths from SWE-bench Verified tasks. We find a similar exponential trend, although with a shorter doubling time. However, we believe this shorter doubling time to be a result of SWE-bench Verified time annotations differentially underestimating the difficulty easier SWE-bench tasks. Finally, we collect and baseline a small set of uncontaminated issues from internal METR repositories. We find that our contracted human baseliners take much longer to complete these tasks than repository maintainers. We also find that AI agent performance is worse than would be predicted by maintainer time-to-complete but is consistent with contractor time-to-complete, given the AI agent success curves from HCAST + SWAA + RE-Bench tasks shown in Figure 5.
(For transparency, I am an author on the paper)
Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.
But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.
“messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks
I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.
I think the benchmark is intended to measure performance on an even narrower proxy than this—roughly, the sort of tasks involved in ordinary, everyday software engineering.
Note that METR has also published a subsequent attempt to broaden the class of activities, and has some suggestive results that the qualitative exponentially increasing time horizon phenomenon is somewhat robust, but the growth rate varies between domains.
Right. Task Difficulty is a hard thing to get a handle on.
* You don't know how hard a problem is until you've solved it, so any metric needs to depend on how hard the problem was to solve.
* Intuitively, we might want some metric that depends on the "stack trace" of the solution, i.e. what sorts of mental moves had to happen for the person to solve the problem. Incidentally, this means that all such metrics are sometimes over-estimates (maybe there's an easy way to solve the problem that the person you watched solving it missed.) Human wall-clock time is in some sense the simplest question one could ask about the stack trace of a human solving the problem.
* The difficulty of a problem is often agent relative. There are plenty of contest math problems that are rendered easy if you have the correct tool in your toolkit and really hard if you don't. Crystalized intelligence often passes for fluid intelligence, and the two blend into each other.
Some other potential metrics (loose brainstorm)
* Hint length - In some of Eliezer's earlier posts, intelligence got measured as optimization pressure in bits (intuitively: how many times do you have to double the size of the target for it to fill the entire dart-board. Of course you need a measure space for your dart board in order for this to work.) Loosely inspired by this, we might pick some model that's capable of following chains of logic but not very smart (whatever it knows how to do is a stand in for what's obvious.) Then ask how long of a hint string you have to hand it to solve the problem. (Of course finding the shortest hint string is hard; you'd need to poke around heuristically to find a relatively short one.)
* ELO score type metrics - You treat each of your puzzles and agents (which can be either AIs or humans) as players of a two player game. If the agent solves a puzzle the agent wins, otherwise the puzzle wins. Then we calculate ELO scores. The nice thing about this is that we effectively get to punt the problem of defining a difficulty metric, by saying that each agent has a latent intelligence variable and each problem has a latent difficulty variable, and we can figure out what both of them are together by looking at who was able to solve which problem.
Caveats: Of course, like human wall-clock time, this assumes intelligence and difficulty are one-dimensional, though of course if you can say what you'd like to assume instead, you can make statistical models more sophisticated than the one ELO scoring implicitly assumes. Also, this still doesn't help for measuring the difficulties of problems way outside the eval set (the using "Paleolithic canoeing records to forecast when humans will reach the moon" obstacle) if everybody loses against puzzle X that doesn't put much of an upper bound on how hard it is.
Curated. This helped me a lot in thinking about what the paper really means. It's also a paper that's affecting a lot of people's thinking about AI, so it's worth highlighting disagreements.
I do agree that METR's horizon work is definitely overrelied on (there's only a few datapoints and there are reasons to believe that the benchmark is biased towards tasks that require little context or memory, among other issues), but I do think the exponential growth in AI capabilities is very plausible a priori, and I wrote up a post on why this should generally be expected (though a caveat is that the doubling times can differ dramatically, so we do need to make sure that we aren't overextrapolating from a narrow selection of tasks), so I think METR's observation of exponential growth is likely to generalize to messy tasks, it's just that the time horizons and doubling factors are different.
Having worked at METR for some months last year, I just want to chime in to add that they have indeed seen the skulls. This post does a great service to the broader public by going into many important points at length. But these issues and others are also very much top of mind at METR, which is one of the reasons why they caveat results extensively in their publications.
If you haven't been in touch or visited them already, I highly recommend it. They're pretty awesome and love to discuss this sort of stuff!
Paleolithic canoeing records to forecast when humans will reach the moon
Not disagreeing with your main point, but Robin Hanson has tried this.
I think your claim the rudimentary abilities arrive before transformational ones cannot be applied to A.I. the same as human intelligence. While humans might have taken millennia to go from caveman painting to our current ability to produce artistic images, it is clear that A.I. became transformational very quickly in that particular field. You see the same transformational abilities in text writing, music and video too and software development is getting there.
Some of the more artistic of these abilities don't have a clear benchmark, but even with more fuzzy criteria for success, they already outcompute most humans.
Some of the building blocks of A.I. are fundamentally different from us, that is why the have difficuty with some tasks. Their metacognitive, learning and memory abilities has been improved significantly over the last couple of years, but it is still a pale shadow compared to what we are capable of. And in some of the transformational tasks, these abilities are essential.
Horizon length is an imperfect measurement of the lack of some of the abilities.
A version of the argument I've heard:
AI can do longer and longer coding tasks. That makes it easier for AI builders to run different experiments that might let them build AGI. So either it's the case that both (a) the long-horizon coding AI won't help with experiment selection at all and (b) the experiments will saturate the available compute resources before they're helpful; or, long-horizon coding AI will make strong AI come quickly.
I think it's not too hard to believe (a) & (b), fwiw. Randomly run experiments might not lead to anyone figuring out the idea they need to build strong AI.
AI can do longer and longer coding tasks.
But this is not a good category; it contains both [the type of long coding task that involves having to creatively figure out several points] and also other long coding tasks. So the category does not support the inference. It makes it easier for AI builders to run... some funny subset of "long coding tasks".
Yup. The missing assumption is that setting up and running experiments is inside the funny subset, perhaps because it's fairly routine
I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.
The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.
The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it's not by chance that "vibe coding" became a thing when it became a thing.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.
I predict this is basically due to noise, or at best is a very short-lived trend, similarly to the purported faster trend of RL scaling allowing a doubling of 4 months on certain tasks that is basically driven by good scaffolding (which is what RL-on-CoTs was mostly shown to be) and not a creation of new capabilities.
Very possible.
I plan to watch this a bit longer and also analyse how the trend changes with repo size.
I strongly suspect that the maximal possible time horizon is proportional to a power of compute invested, multiplied by architectural tweaks: The compute spent scaled exponentially, yielding the exponential trend. If you don't believe that anyone will ever train a model on, say, 1E29 or more FLOP, then this and the maximal estimate of might be enough to exclude the possibility to obtain CoT-based superhuman AIs which the Slowdown Ending of the AI-2027 forecast relies upon in order to solve alignment.
Current AI models are strange. They can speak—often coherently, sometimes even eloquently—which is wild. They can predict the structure of proteins, beat the best humans at many games, recall more facts in most domains than human experts; yet they also struggle to perform simple tasks, like using computer cursors, maintaining basic logical consistency, or explaining what they know without wholesale fabrication.
Perhaps someday we will discover a deep science of intelligence, and this will teach us how to properly describe such strangeness. But for now we have nothing of the sort, so we are left merely gesturing in vague, heuristical terms; lately people have started referring to this odd mixture of impressiveness and idiocy as “spikiness,” for example, though there isn’t much agreement about the nature of the spikes.
Of course it would be nice to measure AI progress anyway, at least in some sense sufficient to help us predict when it might become capable of murdering everyone. But how can we, given only this crude, informal understanding? When AI minds seem so different in kind from animal minds—the only sort we’ve had a chance to interact with, until now—that even our folk concepts barely suffice?
Predicting the future is tricky in the average case, and this case seems far more cursed than average. Given its importance, I feel grateful that some have tried hard to measure and predict AI progress anyway, despite the profundity of our ignorance and the bleakness of the task. But I do think our best forecasts so far have had much more success at becoming widely discussed than at reducing this ignorance, and I worry that this has caused the discourse about AI timelines to become even more confused, muddled by widely shared yet largely unwarranted confidence.
Take “horizon length,” for example, a benchmark introduced by METR earlier this year as a sort of “Moore’s law for AI agents.” This benchmark received substantial attention as the main input to the AI 2027 timelines forecast, which has been read—or watched, or heard—by millions of people, including the Vice President of the United States.
The basic idea of the benchmark is to rank the difficulty of various tasks according to the amount of time they take humans, and then to rank AI models according to the “difficulty” (in this sense) of the tasks they can complete. So if a given model has a “50% time horizon of 4 minutes,” for example, that means it succeeded half the time at accomplishing some set of tasks that typically take humans 4 minutes.
As I understand it, METR’s hope is that this measure can serve as something like an “omnibenchmark”—a way to measure the performance of roughly any sort of model, across roughly any sort of task, in common units of “how long they take humans to do.” And indeed performance on this benchmark is steadily improving over time, as one might expect if it reflected predictable growth in AI capabilities:
So while GPT-2 could only complete tasks that take humans mere seconds, current models can complete tasks that take humans over an hour. METR's proposal is that we extrapolate from this data to predict when AI will gain the kind of capabilities we would strongly prefer to have advance warning about—like substantially automating AI R&D (which METR suggests may require a horizon length of tens of hours), or catastrophically harming society (of one month).
Personally, I feel quite skeptical that this extrapolation will hold.
Given that humans are our only existing example of decent agents, I think one obvious sanity check for proposed measures of AI agency is whether they are helpful for characterizing variation in human agency. Is horizon length? Is there some meaningful sense in which, say, the unusual scientific or economic productivity of Isaac Newton or James Watt, can be described in terms of the “time horizon” of their minds? If there is, I at least have failed to imagine it.
One basic problem with this measure, from my perspective, is that the difficulty of tasks is not in general well-described as a function of the time needed to complete them. Consider that it took Claude Shannon ~5 years to discover information theory, and Roald Amundsen ~3 years to traverse the Northwest Passage—is there some coherent sense in which Amundsen’s achievement was “⅗ as hard”?
Certainly the difficulty of many tasks varies with their time cost all else equal, but I think all else is rarely equal since tasks can be difficult in a wide variety of ways. It would be thermodynamically difficult to personally shovel a canal across Mexico; computationally difficult to factor the first trillion digits of π; interpersonally difficult to convince Vladimir Putin to end the war in Ukraine; scientifically difficult to discover the laws of electromagnetism...
... and personally, I feel skeptical that all such difficulties can be sensibly described in common, commensurate units of time cost. And so I doubt that “horizon length” is well-suited for assessing and comparing AI performance across a wide range of domains.
Of course the benchmark might still be useful, even if it fails to suffice as a general, Moore’s law-style measure of AI agency—perhaps it can help us track progress toward some particular capabilities, even if not progress toward all of them.
As I understand it, METR’s hope—and similarly, AI 2027's hope in relying on the benchmark for their forecast—is that horizon length might be particularly predictive of progress at AI R&D, and hence of when AI might gain the ability to recursively self-improve. As such, the benchmark is designed to directly measure AI ability only in the narrower domain of “coding” or “computer use” tasks.
But these too strike me as strange concepts. Computers being Turing-complete, the space of possible “computer use” tasks is of course large, encompassing (among much else) all cognition performable by brains. So the set of possible computer use skills, at least, does not seem much narrower than the set “all possible skills.”
In practice I think the benchmark is intended to measure performance on an even narrower proxy than this—roughly, the sort of tasks involved in ordinary, everyday software engineering. But "software engineering" also involves a large and heterogeneous set of skills, ranging from e.g. “making a webpage” to “inventing transformers.” And in my view, it seems quite unclear that the acquisition of simple skills like the former reflects knowable amounts of progress toward transformative skills.
Unfortunately, I think the case for "horizon length" predicting transformative AI is weak even if one does assume everyday software engineering skills are the sort of thing needed to create it, since the tasks the benchmark measures are unrepresentative even of those.
The "horizon length" benchmark measures performance on three sets of tasks:
I think these tasks probably differ in many ways from tasks like "conquering humanity" or "discovering how to become as powerful as physics permits." They are mostly very simple,[1] for example, and none require models to think novel thoughts.
But one especially glaring difference, by my lights, is that the benchmark consists exclusively of precisely-specified, automatically-checkable tasks. This is typical of AI benchmarks, since it is easy to measure performance on such tasks, and hence easy to create benchmarks based on them; it just comes at the price, I suspect, of these proxies differing wildly from the capabilities they are meant to predict.
At the risk of belaboring the obvious, note that many problems are unlike this, in that the very reason we consider them problems is because we do not already know how to solve them. So the kind of problem for which it is possible to design precisely-specified, automatically-checkable tests—for brevity, let's call these benchmarkable problems—have at minimum the unusual property that their precise solution criteria are already known, and often also the property that their progress criteria are known (i.e., that it is possible to measure relative progress toward finding the solution).
It seems to me that all else equal, problems that are benchmarkable tend to be easier than problems that are not, since solutions whose precise criteria are already known tend to be inferentially closer to existing knowledge, and so easier to discover. There are certainly many exceptions to this, including some famous open problems in science and mathematics.[2] But in general, I think the larger the required inferential leap, the harder it tends to be to learn the precise progress or solution criteria in advance.
I suspect that by focusing on such tasks, AI benchmarks suffer not just from a bias toward measuring trivial skills, but also from a bias toward measuring the particular sorts of skills that current AI systems most often have. That is, I think current AI models tend to perform well on tasks roughly insofar as they are benchmarkable, since if the solution criteria is known—and especially if the progress criteria is also known—then it is often possible to train on those criteria until decent performance is observed.
(I presume this is why AI companies consider it worth paying for better benchmarks, and inventing their own in-house—they are directly useful as a training target).
So I expect there is a fairly general benchmark bias, affecting not just "horizon length" but all benchmarks, since the tasks on which it is easy to measure AI performance tend to be those which AI can be trained to perform unusually well.[3] If so, benchmark scores may systematically overestimate AI capabilities.
The value of "horizon length" for predicting transformative AI depends on how much progress on the proxy tasks it measures correlates with progress toward abilities like autonomously generating large amounts of wealth or power, inventing better ML architectures, or destroying civilization. Insofar as it does, we can extrapolate from this progress to estimate the time we have left on ancient Earth.
I do not know what skills current AI lacks, that transformative AI would require. But personally, I am skeptical that we learn much from progress on tasks as simple as those measured by this benchmark. To me, this seems a bit like trying to use Paleolithic canoeing records to forecast when humans will reach the moon, or skill at grocery shopping as a proxy for skill at discovering novel mathematics.[4]
Of course all else equal I expect rudimentary abilities to arrive earlier than transformational ones, and so I do think benchmarks like this can provide useful evidence about what AI capabilities already exist—if e.g. current models routinely fail at tasks because they can't figure out how to use computer cursors, it seems reasonable to me to guess that they probably also can't yet figure out how to recursively self-improve.
But it seems much less clear to me how this evidence should constrain our expectations about when future abilities will arrive. Sure, AI models seem likely to figure out computer cursors before figuring out how to foom, like how humans figured out how to build canoes before spaceships—but how much does the arrival date of the former teach us about when the latter will arrive?
One obvious reason it might teach us a lot, actually, is if these simple skills lay on some shared, coherent skill continuum with transformative skills, such that progress on the former was meaningfully the same "type" of thing as progress toward the latter. In other words, if there were in fact some joint-carvey cluster in the territory like "horizon length," then even small improvements might teach us a lot, since they would reflect some knowable amount of progress toward transformative AI.
I do not see much reason to privilege the hypothesis that "horizon length" is such a cluster, and so I doubt it can work as a general measure of AI agency. But this does not rule out that it might nonetheless have predictive value—measures do not need to reflect core underlying features of the territory to be useful, but just to vary in some predictably correlated fashion with the object of inquiry. Sometimes even strange, seemingly-distant proxies (like e.g. Raven's Matrices) turn out to correlate enough to be useful.
Perhaps "horizon length" will prove similarly useful, despite its dubious coherence as a concept and the triviality of its tests. For all I know, the fact that the benchmark measures something related at all to the time cost of tasks, or even just something related at all to what AI systems can do, is enough for it to have predictive value.
But personally, I think the case for this value is weak. And so I feel very nervous about the prospect of using such benchmarks to "form the foundation for responsible AI governance and risk mitigation," as METR suggests, or as the basis for detailed, year-by-year forecasts of AI progress like AI 2027.
AI failures are often similarly simple. E.g., one common reason current models fail is because they can't figure out how to use computer cursors well enough to begin the task.
Perhaps there is some meaningful "agency" skill continuum in principle, on which "ability to use a mouse" and "ability to conquer humanity" both lie, such that evidence of the former milestone being reached should notably constrain our estimate of the latter. But if there is, I claim it is at least not yet known, and so cannot yet help reduce our uncertainty much.
I suspect it's often this unusual operationalizability itself, rather than importance, that contributes most to these problems' fame, since they're more likely to feature in famous lists of problems (like e.g. Hilbert's problems) or have famous prizes (like e.g. the Millennium Prize Problems).
Relatedly, all else equal I expect to feel less impressed by AI solving problems whose solution and progress criteria were known, than those whose solution criteria only was known, and most impressed if neither were (as e.g. with many open problems in physics, or the alignment problem).
(I would guess this bias is further exacerbated by AI companies sometimes deliberately training on benchmarks, to ensure their models score well on the only legible, common knowledge metrics we have for assessing their products).
I have had the good fortune of getting to know several mathematicians well, and hence of learning how uncorrelated such skills can be.