Samuel Albanie

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

TL;DR: I predict there will be an initially sharp (and then later smooth) increase in 50%-task-completion time horizon in the near future, significantly above the trend suggested by Figure 1. 

This is primarily due to (i) the prevalence of high-context tasks at greater task durations; (ii) “context overhang” - information that’s not currently placed into the context window but likely soon will be.

Before going into details, I’d like to say:

  • I think this paper is excellent. I applaud the authors for the care and attention they put into its execution, particularly the use of multiple sanity checks for external validity.
  • The authors provide detailed discussions around their use of “low-context” tasks in both the Measuring AI Ability to Complete Long Tasks paper (in Section 8.1) and the HCAST paper (in Section 5).  The choices seem well-justified to me. Consider reading those sections before the rest of my comment.
  • My comments are less of a critique than a description of how I interpret the findings of the paper.
  • I’d love to see a detailed future study that studies human onboarding time vs human on-task time to better predict the behavior of AI agents in a high-context task regime.

On we go. 

My rationale is as follows:
Claim 1: The benefits of onboarding are large for humans on high-context tasks.  
Claim 2: Models load context at very high bandwidth relative to humans, but current AI agents are not given access to the same information as human staff when human staff onboard onto a project (the “context overhang”). 
Claim 3: Once this changes, AI agents will attain (at least) the majority of benefits that human staff derive from onboarding on high-context tasks.
Claim 4: Many long horizon, economically valuable tasks are high-context tasks. 
Claim 5: Since the trend in Figure 1 is derived from low-context tasks, it will be too conservative once extended further up the y-axis if the tasks are to remain representative. 

Background

There are two[1] types of tasks studied in this paper.

Low-context tasks. By design, the paper focuses on evaluations of performance on low-context tasks (SWAA, RE-Bench HCAST). As @Thomas Kwa notes above, the task suite (excluding the METR repo tasks) focuses on “well-defined, low-context, measurable software tasks that can be done without a GUI.” Low-context tasks are roughly those for which prior onboarding provides limited benefit. 

High-context tasks. These are tasks where humans attain benefits from onboarding. The five uncontaminated METR repo issues studied in section 6.4 are examples of high-context tasks.

The tasks used to fit the curve in Figure 1 are low-context tasks.[2]

Claim 1: The benefits of onboarding are large for humans on high-context tasks.[3]

In section 6.4, the authors compare the performance of internal contractors to repository maintainers on high-context tasks (internal METR repo issues). Contract baseliners take 5x-18x longer to resolve issues than repository maintainers. Given the small sample size, let’s use a ~10x speed difference as a heuristic representing the benefit of deep familiarity with a specific codebase and its surrounding context.

I interpret this 10x difference as deriving primarily from the amortized cost of onboarding (or “context acquisition”).[4] The repository maintainers have already invested significant time learning the codebase, relevant discussions, historical issues, tooling, etc. This prior investment pays off in faster resolution times for new tasks within that domain. Contractors, lacking this deep context, must spend more time "onboarding" during the task itself.

Claim 2: Models load context at very high bandwidth relative to humans, but current AI agents are not given access to the same information as human staff when human staff onboard onto a project. This creates a “context overhang”. 

Examples of information not typically provided to the model with scaffolds such as modular-public (used in this work) include: relevant chat threads with other developers, future plans, priorities, emails, transcripts of in-person conversations, long-running terminal histories from previous tasks on the same server, git diffs, other GitHub issues, similar GitHub repos to the current task, the GitHub repos of all dependency libraries, etc.

Some reasons the onboarding artifacts listed above are not widely used by agents so far include (1) they are stored in formats/channels/media optimized for human consumption, (2) they impose significant information security challenges. I’d guess that these barriers will be overcome for economic reasons (many of these artifacts have a high signal-to-noise ratio, as suggested by the fact that humans spend considerable time interacting with them). 

For the sake of completeness, context overhang is not the only consideration here. In the RE-bench evaluations, the best performance for a given time budget was achieved by slicing up the time into n pieces (for n > 1) and performing best-of-n sampling. This reflects a limitation with the tested models rather than context overhang. 

Claim 3: Once this changes, AI agents will attain the majority (and perhaps more) of the benefits that human staff derive from onboarding on high-context tasks.[5] 

Human staff have low bandwidth relative to LLMs, so they must onboard slowly over time. This makes it quite difficult to catalog their context acquisition process.  But there’s no compelling reason why LLMs should not also be capable of leveraging this context if they are provided with it. Indeed, there is no reason for AI agent performance to stop at human-level onboarding. 

Using myself as an example human, I observe that humans often fail to use the best tool for a given ML R&D task. For instance, I often stick to suboptimal libraries and coding languages for a given task in order to avoid additional onboarding costs. Superhuman onboarding (e.g. having intimate knowledge of the 100 most relevant public repositories, 1K GitHub issues, 10K emails, 100K chat threads) seems eminently feasible.  Such a model should be far more capable of adeptly selecting the best tool for the job. 

Once these “high-signal” information streams are tapped, I’d expect there to be diminishing returns, just as there are with any increase in context window size (presumably log-linear).  This explains my prediction for “an initially sharp (and then later smooth) increase in 50%-task-completion time horizon.” 

At some point, economics will justify test-time training, but I expect this to come later.

Claim 4: Many long horizon, economically valuable tasks are high-context tasks. 

Here, I’ll quote the authors from Section 5 in their HCAST work. 

“...most tasks performed by human machine learning or software engineers tend to require referencing prior context that may not be easy to compactly describe, and does not consist of unambiguously specified, modular tasks.”

Claim 5: Since the initial trend in Figure 1 is derived from low-context tasks, it will be too conservative once extended further up the y-axis if the tasks are to remain representative.[6]

From Claim 4, as models progress further up the y-axis, they will increasingly be evaluated on high-context tasks. 

Collectively, I believe the claims support the prediction that the actual horizon curve will rise more sharply than the Figure 1 trend suggests.

 

  1. ^

    I’m binarizing for simplicity - in practice, of course, there’s a spectrum.

  2. ^

    In discussion of their METR repo experiments on page 16 of their paper, the authors note that “...time horizons may have better correspondence to the labor of a low-context human, rather than a high-context human.” As measured on the current suite of tasks, I'd agree with this assessment.

  3. ^

    Since this claim risks being tautological with the definition, I'll note that the emphasis is on the magnitude (the gains are large, e.g., ~10x). 

  4. ^

    There are also many other effects. For instance, presumably METR employees have already been exposed to substantial selection effects that select for performance on such tasks.

  5. ^

    For the sake of being explicit, my arguments in Claim 3 are based on public information. 

  6. ^

    If you interpret the purpose of Figure 1 as being to solely forecast horizons on low-context tasks, then my claim doesn’t hold. I don’t believe this is how the figure is being widely interpreted, but I accept that this is a subjective judgment.