The "Length" of "Horizons"

[-]Eli Tyre2d188

Given that humans are our only existing example of decent agents, I think one obvious sanity check for proposed measures of AI agency is whether they are helpful for characterizing variation in human agency.

This seems like an obvious and apt question to ask, but I don't think it's an obvious sanity check, in the sense that "if a measure doesn't pass this check, that's a strong sign that it's not capturing what we care about."

AIs are different than human minds! I think it's not at all surprising if they have different limiting constraints and therefore very different "capability profiles."

Like, for humans, working memory is an important constraint on many of the complicated intellectual operations that we do. And working memory correlates with overall cognitive ability.

When you try to measure human intelligence , and figure out what it is made of, working memory is something like one of the major factors that falls out of the factor analysis.

But if we imagine aliens that have vastly larger working memories (or "context windows") than humans. These aliens might still vary in working memory capacity, but it might be close to irrelevent for predicting their overall cognitive performance, because the bottlenecks on their cognitive ability are something else entirely.

I think that's exactly the situation we're in with the AIs. Their minds are of a quite different shape than ours, and so good proxy metrics for human capability won't generalize to AIs or vis versa.

Overall, great post.

[-]Bronson Schoen6d170

My interpretation of the METR results are an empirical observation of a trend that seems robust, in the same way scaling laws are. You could write the same post about why there’s no robust first principles reason that “cross-entropy loss decreases with scale in a way that correlates in an important, predictably useful way with an absurdely wide range of downstream tasks”.

The METR paper itself is almost entirely justifying the empirical prediction aspect, not a first principles argument for the approach from a theoretical perspective. I think the robustness of this analysis is why the paper had the impact it did. Are there specifics of the statistical analysis they did for the stuff around:

Since our tasks do not perfectly represent the average segment of intellectual labor by researchers and software engineers, this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks. We include results from three supplementary external validity experiments.

That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?

[-]Aaron_Scher4d1512

That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?

I'm not Adam, but my response is "No", based on the description Megan copied in thread and skimming some of the paper. It's good that the paper includes those experiments, but they don't really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):

Conceptual coherence: in humans there are different skills, e.g., between different fields, that don't seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we're worried about.

Do the experiments in Sec 6 deal with this?

No SWAA ("Retrodiction from 2023–2025 data"): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don't think it's conclusive. Quoting from the paper some:

We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.
The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) ...
However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.

This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don't have very messy tasks.

c. SWE-Bench Verified: doesn't speak to 1 or 2.

d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they're still clearly verifiable and still software engineering.

I do think Thomas and Vincent's follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.

[-]Bronson Schoen3d10

I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.

As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.

Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?

[-]Aaron_Scher3d42

We care about the performance prediction at a given point in time for skills like "take over the world", "invent new science", and "do RSI" (and "automate AI R&D", which I think the benchmark does speak to). We would like to know when those skills will be developed.

In the frame of this benchmark, and Thomas and Vincent's follow up work, it seems like we're facing down at least three problems:

The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don't know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don't know how well the general "time horizons" approach works across domains. We have some data on this in the follow up work, maybe it's a 2:1 update from a 1:1 prior?

So my overall take is that I think the current work I'm aware of tells us

Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on "time horizons" being common across different reasonable and easy to define distributions.
And on "doubling time in the single digit months" being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50/50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.

[-]Adam Scholl6d*40

I think there is more empirical evidence of robust scaling laws than of robust horizon length trends, but broadly I agree—I think it's also quite unclear how scaling laws should constrain our expectations about timelines.

(Not sure I understand what you mean about the statistical analyses, but fwiw they focused only on very narrow checks for external validity~~—mostly just on whether solutions were possible to brute force~~).

[-]Megan Kinniment5d*103

fwiw they focused only on very narrow checks for external validity—mostly just on whether solutions were possible to brute force

This seems inaccurate to me. Here's the introduction to the external validity and robustness section of the paper:

To investigate the applicability of our results to other benchmarks, and to real task distributions, we performed four supplementary experiments. First, we check whether the 2023–2025 trend without the SWAA dataset retrodicts the trend since 2019, and find that the trends agree. Second, we label each of our tasks on 16 “messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks and (2) be relevant to AI agent performance. Third, we calculate AI agent horizon lengths from SWE-bench Verified tasks. We find a similar exponential trend, although with a shorter doubling time. However, we believe this shorter doubling time to be a result of SWE-bench Verified time annotations differentially underestimating the difficulty easier SWE-bench tasks. Finally, we collect and baseline a small set of uncontaminated issues from internal METR repositories. We find that our contracted human baseliners take much longer to complete these tasks than repository maintainers. We also find that AI agent performance is worse than would be predicted by maintainer time-to-complete but is consistent with contractor time-to-complete, given the AI agent success curves from HCAST + SWAA + RE-Bench tasks shown in Figure 5.

(For transparency, I am an author on the paper)

[-]Adam Scholl3d20

Sorry, looking again at the messiness factors fewer are about brute force than I remembered; will edit.

But they do indeed all strike me as quite narrow external validity checks, given that the validity in question is whether the benchmark predicts when AI will gain world-transforming capabilities.

“messiness” factors—factors that we expect to (1) be representative of how real-world tasks may systematically differ from our tasks

I felt very confused reading this claim in the paper. Why do you think they are representative? It seems to me that real-world problems obviously differ systematically from these factors, too—e.g., solving them often requires having novel thoughts.

[-]Oliver Sourbut5d143

I think the benchmark is intended to measure performance on an even narrower proxy than this—roughly, the sort of tasks involved in ordinary, everyday software engineering.

Note that METR has also published a subsequent attempt to broaden the class of activities, and has some suggestive results that the qualitative exponentially increasing time horizon phenomenon is somewhat robust, but the growth rate varies between domains.

[-]Marcello2d120

Right. Task Difficulty is a hard thing to get a handle on.

* You don't know how hard a problem is until you've solved it, so any metric needs to depend on how hard the problem was to solve.

* Intuitively, we might want some metric that depends on the "stack trace" of the solution, i.e. what sorts of mental moves had to happen for the person to solve the problem. Incidentally, this means that all such metrics are sometimes over-estimates (maybe there's an easy way to solve the problem that the person you watched solving it missed.) Human wall-clock time is in some sense the simplest question one could ask about the stack trace of a human solving the problem.

* The difficulty of a problem is often agent relative. There are plenty of contest math problems that are rendered easy if you have the correct tool in your toolkit and really hard if you don't. Crystalized intelligence often passes for fluid intelligence, and the two blend into each other.

Some other potential metrics (loose brainstorm)

* Hint length - In some of Eliezer's earlier posts, intelligence got measured as optimization pressure in bits (intuitively: how many times do you have to double the size of the target for it to fill the entire dart-board. Of course you need a measure space for your dart board in order for this to work.) Loosely inspired by this, we might pick some model that's capable of following chains of logic but not very smart (whatever it knows how to do is a stand in for what's obvious.) Then ask how long of a hint string you have to hand it to solve the problem. (Of course finding the shortest hint string is hard; you'd need to poke around heuristically to find a relatively short one.)

* ELO score type metrics - You treat each of your puzzles and agents (which can be either AIs or humans) as players of a two player game. If the agent solves a puzzle the agent wins, otherwise the puzzle wins. Then we calculate ELO scores. The nice thing about this is that we effectively get to punt the problem of defining a difficulty metric, by saying that each agent has a latent intelligence variable and each problem has a latent difficulty variable, and we can figure out what both of them are together by looking at who was able to solve which problem.

Caveats: Of course, like human wall-clock time, this assumes intelligence and difficulty are one-dimensional, though of course if you can say what you'd like to assume instead, you can make statistical models more sophisticated than the one ELO scoring implicitly assumes. Also, this still doesn't help for measuring the difficulties of problems way outside the eval set (the using "Paleolithic canoeing records to forecast when humans will reach the moon" obstacle) if everybody loses against puzzle X that doesn't put much of an upper bound on how hard it is.

[-]Ben Pace4d126

Curated. This helped me a lot in thinking about what the paper really means. It's also a paper that's affecting a lot of people's thinking about AI, so it's worth highlighting disagreements.

[-]Noosphere895d103

I do agree that METR's horizon work is definitely overrelied on (there's only a few datapoints and there are reasons to believe that the benchmark is biased towards tasks that require little context or memory, among other issues), but I do think the exponential growth in AI capabilities is very plausible a priori, and I wrote up a post on why this should generally be expected (though a caveat is that the doubling times can differ dramatically, so we do need to make sure that we aren't overextrapolating from a narrow selection of tasks), so I think METR's observation of exponential growth is likely to generalize to messy tasks, it's just that the time horizons and doubling factors are different.

[-]mtaran2d86

Having worked at METR for some months last year, I just want to chime in to add that they have indeed seen the skulls. This post does a great service to the broader public by going into many important points at length. But these issues and others are also very much top of mind at METR, which is one of the reasons why they caveat results extensively in their publications.

If you haven't been in touch or visited them already, I highly recommend it. They're pretty awesome and love to discuss this sort of stuff!

[-]Gunnar_Zarncke5d7-2

Paleolithic canoeing records to forecast when humans will reach the moon

Not disagreeing with your main point, but Robin Hanson has tried this.

[-]Philip Niewold3d50

I think your claim the rudimentary abilities arrive before transformational ones cannot be applied to A.I. the same as human intelligence. While humans might have taken millennia to go from caveman painting to our current ability to produce artistic images, it is clear that A.I. became transformational very quickly in that particular field. You see the same transformational abilities in text writing, music and video too and software development is getting there.

Some of the more artistic of these abilities don't have a clear benchmark, but even with more fuzzy criteria for success, they already outcompute most humans.

Some of the building blocks of A.I. are fundamentally different from us, that is why the have difficuty with some tasks. Their metacognitive, learning and memory abilities has been improved significantly over the last couple of years, but it is still a pale shadow compared to what we are capable of. And in some of the transformational tasks, these abilities are essential.

Horizon length is an imperfect measurement of the lack of some of the abilities.

[-]kave6d5-3

A version of the argument I've heard:

AI can do longer and longer coding tasks. That makes it easier for AI builders to run different experiments that might let them build AGI. So either it's the case that both (a) the long-horizon coding AI won't help with experiment selection at all and (b) the experiments will saturate the available compute resources before they're helpful; or, long-horizon coding AI will make strong AI come quickly.

I think it's not too hard to believe (a) & (b), fwiw. Randomly run experiments might not lead to anyone figuring out the idea they need to build strong AI.

[-]TsviBT6d74

AI can do longer and longer coding tasks.

But this is not a good category; it contains both [the type of long coding task that involves having to creatively figure out several points] and also other long coding tasks. So the category does not support the inference. It makes it easier for AI builders to run... some funny subset of "long coding tasks".

[-]kave6d42

Yup. The missing assumption is that setting up and running experiments is inside the funny subset, perhaps because it's fairly routine

[-]Adam Scholl6d64

I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.

[-]p.b.5d3-3

The way METR time horizons tie into AI 2027 is very narrow: As a trend not even necessarily on coding/software engineering skills but on machine learning engineering. I think that is hard to attack except by claiming that the trend will taper off. AI 2027 does not require unrealistic generalisation.

The reason why I think that time horizons are much more solid evidence of AI progress then earlier benchmarks, is that the calculated time horizons explain the trends in AI-assisted coding over the last few years very well. For example it's not by chance that "vibe coding" became a thing when it became a thing.

I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.

[-]Noosphere892d50

I have computed time horizon trends for more general software engineering tasks (i.e. with a bigger context) and my preliminary results point towards a logistic trend, i.e. the exponential is already tapering off. However, I am still pretty uncertain about that.

I predict this is basically due to noise, or at best is a very short-lived trend, similarly to the purported faster trend of RL scaling allowing a doubling of 4 months on certain tasks that is basically driven by good scaffolding (which is what RL-on-CoTs was mostly shown to be) and not a creation of new capabilities.

[-]p.b.11h20

Very possible.

I plan to watch this a bit longer and also analyse how the trend changes with repo size.

[-]StanislavKrym5d10

I strongly suspect that the maximal possible time horizon is proportional to a power of compute invested, multiplied by architectural tweaks: The compute spent scaled exponentially, yielding the exponential trend. If you don't believe that anyone will ever train a model on, say, 1E29 or more FLOP, then this and the maximal estimate of $C_{q u a l i t y}$ might be enough to exclude the possibility to obtain CoT-based superhuman AIs which the Slowdown Ending of the AI-2027 forecast relies upon in order to solve alignment.

^{^}

AI failures are often similarly simple. E.g., one common reason current models fail is because they can't figure out how to use computer cursors well enough to begin the task.

Perhaps there is some meaningful "agency" skill continuum in principle, on which "ability to use a mouse" and "ability to conquer humanity" both lie, such that evidence of the former milestone being reached should notably constrain our estimate of the latter. But if there is, I claim it is at least not yet known, and so cannot yet help reduce our uncertainty much.

^{^}

I suspect it's often this unusual operationalizability itself, rather than importance, that contributes most to these problems' fame, since they're more likely to feature in famous lists of problems (like e.g. Hilbert's problems) or have famous prizes (like e.g. the Millennium Prize Problems).

Relatedly, all else equal I expect to feel less impressed by AI solving problems whose solution and progress criteria were known, than those whose solution criteria only was known, and most impressed if neither were (as e.g. with many open problems in physics, or the alignment problem).

^{^}

(I would guess this bias is further exacerbated by AI companies sometimes deliberately training on benchmarks, to ensure their models score well on the only legible, common knowledge metrics we have for assessing their products).

^{^}

I have had the good fortune of getting to know several mathematicians well, and hence of learning how uncorrelated such skills can be.

LESSWRONG
LW

LESSWRONG
LW

155

The "Length" of "Horizons"

155

155

Conceptual Coherence

Benchmark Bias

Predictive Value