AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines

ryan_greenblatt

I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. ^[1] The largest updates I've made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I'm now a bit below 30% ^[2] while I was previously expecting around 15%; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don't require that much novel ideation ^[3] . For instance, I expect that by EOY 2026, AIs will have a 50%-reliability ^[4] time horizon of years to decades on reasonably difficult easy-and-cheap-to-verify SWE tasks that don't require much ideation (while the high reliability—for instance, 90%—time horizon will be much lower, more like hours or days than months, though this will be very sensitive to the task distribution). In this post, I'll explain why I've made these updates, what I now expect, and implications of this update.

I'll refer to "Easy-and-cheap-to-verify SWE tasks" as ES tasks and to "ES tasks that don't require much ideation (as in, don't require 'new' ideas)" as ESNI tasks for brevity.

Here are the main drivers of my update:

Opus 4.5 and Codex 5.2 were both significantly above my expectations (on both benchmarks and other sources of information). This isn't that much of an update by itself, we should expect some variation and some models to be decently large jumps, but then Opus 4.6 (and probably Codex 5.3 and 5.4) were again above my expectation even after Opus 4.5 and Codex 5.2. In 2025 we saw roughly 3.5 month doubling times on METR 50%-reliability time horizon and a big jump (though with an unreliable measurement) right at the start of 2026.
I've seen demonstrations of AIs accomplishing very large and impressive ES tasks given only moderately sophisticated scaffolding. As in, tasks that would take humans months to years (some of these tasks were ones that weren't contaminated by results on the internet, eliminating that explanation). These demonstrations are: various things I've done with a scaffold I wrote, the C compiler that was (almost entirely) autonomously written by Claude, some cyber results I've seen, and some other soon-to-be released results from METR and Epoch AI ^[5] . Due to this, I tentatively believe that (as of March 1st) the well-elicited 50% reliability time-horizon on ESNI tasks (using only publicly available models) is somewhere between a month and several years (supposing that the AI's overall budget for both tokens and experiments corresponds to roughly what a human would cost to do the same work). I think the high reliability (e.g. 90%) time horizon is much lower.
I now expect a substantial training compute scale up in 2026 (probably mostly pretraining) and I expect this to yield large returns.
I've updated towards somewhat larger scaffolding overhang on very large tasks than I previously thought was present (based on observations of AI performance given different types of scaffolding). Thus, I expect significant improvements in usefulness relative to what's currently in widespread public usage from relatively straightforward scaffolding improvements.

I was previously thinking that frontier AI progress in 2026 would be a bit slower than in 2025 ^[6] (as measured in effective compute or something like ECI), but due to these factors, I now expect progress in 2026 to be a decent amount faster than progress in 2025.

It's worth noting that AIs being more useful (for AI R&D) accelerates AI progress (in addition to being an update towards being closer to various other milestones). So, when I update towards being further along in the timeline and towards AI being more useful at a lower level of capability, I also update towards a faster rate of progress this year. ^[7]

A key place where I was wrong in the past is that the 50%-reliability time horizon now seems to be around 20x longer on ESNI tasks than METR's task suite (and similar task distributions)—and well greater than 100x is plausible—but I expected a gap of only about 4x. (This error is pretty clear in my predictions in this post.) (There is also a gap where AIs' time horizon on "randomly selected internal tasks at AI companies" is shorter than on METR's task suite (and similar), but this looks like a factor of 2 or 3 and doesn't currently seem to be rapidly growing.)

What's going on with these easy-and-cheap-to-verify tasks?

What explains this very high performance on ES tasks? The core thing is that you can get the AI to develop a test suite / benchmark set and then it can spend huge amounts of time making forward progress by optimizing its solution against this evaluation set. This is most helpful when incrementally improving/fixing things based on test/benchmark results is generally doable (and it's easy for the AI to see what needs to be fixed), it's not that hard to develop a sufficiently good test suite / benchmark set, and running the test suite / benchmark set isn't that hard. These properties hold for many types of very well-specified fully CLI ^[8] software tasks (and software tasks that are most focused on improving some relatively straightforward metrics).

This type of loop means that even if sometimes the AI gets confused or makes bad calls, there is some correcting factor and mistakes usually aren't critical. You can do things like having multiple different AIs write test sets or getting the AI to incrementally improve the test suite / benchmark set over time to avoid mistakes on the testing yielding overall failures. On many other types of tasks, AIs are limited by having somewhat poor judgment or making kind of dumb mistakes and having a hard time recognizing these mistakes. But, with the ability to just keep iterating, they can do well.

I think we're well into the superexponential progress on 50% reliability time-horizon regime for these ESNI tasks: because sufficient generality and error recovery allows for infinite time horizon (the AI can just keep noticing and recovering from its mistakes), beyond some point each successive doubling of time-horizon will be easier than the prior one. See here for more discussion of superexponentiality. The level of generality needed to enter the superexponential regime for ESNI tasks is lower as it's easier to spot and recover from mistakes.

A core thing I wasn't properly pricing in is that a task being easy-and-cheap-to-verify helps at two levels: it's both easier for AI companies to optimize (both directly in RL and as an "outer loop" metric) and it's easier for AIs themselves to just keep applying labor at runtime.

Thus, we can imagine a hierarchy of tasks:

ES tasks
Tasks that can be readily checked for training/evaluation but the AI can't easily check itself
Harder-to-check tasks

It seems as though the gap between (1) and (2) is much larger than the gap between (2) and (3).

A separate dimension is how much the task requires ideation. The more that having somewhat clever ideas is important, the less the AI can operate very iteratively. More generally, tasks vary in how much they are best done with incremental iteration. Some types of software like distributed/concurrent systems and algorithms-heavy software are substantially harder to build iteratively. And lots of software is more schlep-heavy and is just a large number of different things that need to get done, making incremental progress more viable. (A core question is how much it's important to carefully understand the broader complex whole and think of a good way to do/structure things vs. you can just iterate on smaller components.)

Some evidence against shorter timelines I've gotten in the same period

One thing we might wonder is if METR's task suite and similar evaluations were just underelicited and better scaffolding (that e.g. gets the AIs to write tests and then optimize against these tests) would make a big difference. I currently think certain types of better scaffolding might make a moderately big difference on METR's task suite, but that this isn't the main driver of the time-horizon gap between ESNI tasks and METR's task suite. Most of that gap is about the task distribution (checkability, iterability, the remaining unsolved tasks not being central SWE tasks) with AIs actually being bottlenecked by real capability limitations on their current task suite (though because of the task distribution, these capability limitations don't strongly preclude large acceleration of AI R&D). That said, I think scaffolding is increasingly becoming a big deal and will matter more for next-generation models. (In short: I think scaffolding is quite important for current and near future AIs when the task is sufficiently large in scope that completing the task would naturally take up a large fraction of the model's context window, like at least 1/3.)

I think AIs have quite bad "taste" and "judgment" in many domains (generally more so stuff that's harder to RL on) and that this is improving substantially slower than general agentic capabilities. By "taste" and "judgment", I mean something like "making reasonable/good calls in cases that aren't totally straightforward and having good instincts". This includes something like SWE taste which is often the main bottleneck in my experience on somewhat less well-specified SWE tasks and seems to be a major bottleneck on code quality even on very well-specified SWE tasks.

One story here is that taste is mostly driven by pretraining progress or RL on the domain in question (taste doesn't currently generalize that well between domains I think) so outside of heavily RL'd on domains the progress comes mostly from pretraining. And pretraining progress is maybe 2-3x slower than overall AI progress.

However, I do think we might see especially fast pretraining progress in 2026. Thus, I think it's possible these blockers will rapidly improve.

I've seen AIs do a lot of stupid stuff in the course of trying to automate various empirical research projects (though I think some of this stuff that looks like stupidity might be better explained by misalignment / poor RL incentives).

Why does high performance on ESNI tasks shorten my timelines?

The main reasons are:

General capabilities update: I previously didn't think the AIs would be able to do this by now and these tasks are intuitively difficult. This updates me upwards on the overall capability of AIs and on the efficacy of RL. More generally, I just update based on "things have gone faster than I expected".
Superexponentiality: ESNI time-horizon progress seems significantly superexponential, so we've now seen an example of superexponentiality in the wild in one moderately representative domain and it seems like this yielded very fast doubling times. This superexponentiality also kicked in somewhat earlier than my median (in terms of 50% reliability time horizon and qualitative capabilities) for when this would become a big deal. ^[9]
AI R&D acceleration: I think it's pretty plausible that very strong performance on ESNI tasks (especially extremely, extremely strong performance) will allow AIs to substantially speed up AI R&D. As I'll discuss in the next section, I think it's unclear how large of a speed up this will be, but it could be pretty big especially if AIs get better at very ideation-bottlenecked tasks. Additionally, very high performance on ESNI tasks makes it more plausible that relatively small capability improvements greatly improve performance on tasks which have pretty good progress metrics (metrics that can be gamed or don't perfectly capture quality, but where doing better on the metric generally means doing better) but which aren't totally ES tasks (e.g., tasks where verification is expensive or requires a decent amount of judgment).
Scaffolding and prompting underelicitation: While the required scaffold to mostly unlock these capabilities isn't that complex, it is the case that relatively basic scaffolds don't suffice and my understanding is that performance can probably be greatly improved on ES tasks with better (general purpose) prompting and scaffolding. I also think this generally applies for large scope tasks at the limit of what AIs can currently do. This makes me think there is more underelicitation than I was previously thinking. I also think that AIs could be better adapted to these big scaffolds and get better instincts about how to operate in these scaffolds (e.g. how to write instructions for other AIs) which would further boost performance.

How much does extremely high performance on ESNI tasks help with AI R&D?

By default, not that much of currently done AI R&D is straightforwardly an ESNI task. ML research at AI companies typically either requires expensive (potentially very expensive) verification/evaluation or it requires a decent amount of taste and judgment to come up with the idea, set up the experiments, or interpret the results. Building infrastructure or doing efficiency optimization is much more ESNI-like but typically isn't fully ESNI.

What parts of AI R&D are ESNI?

Implementing optimized versions of experiments or architectures given a precise spec for the architecture/experiment. (Allowing for e.g. comparing behavior at small scale to unoptimized known correct implementations.) This could be pretty helpful and makes using more complex and infrastructurally difficult architectures more viable. It also makes heterogeneous compute more viable. (Optimizing many parts of full scale training runs isn't an ES task because verifying correctness and efficiency requires expensive experiments, and some optimizations aren't purely behavior-preserving — e.g., how much does increased asynchrony affect performance?)
Building or optimizing straightforward/well-specified internal tools/infrastructure used for research.
Some types of ML experiments where the results are cheap to verify, most notably some prompting and scaffolding experiments where we have a good (and cheap) benchmark. There might also be valuable very small scale ML experiments (though getting lots of value from these experiments may be bottlenecked on ideation).
Optimizing some applications of AI (either inside the company or to increase revenue).

Here are some things that might or might not be ESNI tasks:

Building RL environments. It's not super easy to verify if an RL environment is reasonable and it's unclear how bad it is for a reasonably large subset of RL environments to be quite flawed.
Collecting and operating on data.

So naively, we'd expect very high performance on just ESNI tasks to be a moderate speed up that results in AI companies quickly getting bottlenecked on something else. Of course, current AIs are also somewhat helpful on other tasks and can generally accelerate lots of engineering.

I don't feel very confident in my picture of how much of AI R&D is an ESNI task, and AI companies might figure out better ways to leverage AIs doing something ESNI-like.

I do think that if AIs were wildly, wildly superhuman on ESNI tasks (or especially if they were wildly superhuman on the broader category of ES tasks), they could potentially massively accelerate AI R&D via (e.g.) massive improvements through just small scale experiments. As a wildly extreme hypothetical, if AIs could generally complete ES tasks a trillion times cheaper and faster than humans (but were somehow just as capable as current AIs on other tasks), I think AI R&D progress would massively accelerate via some mechanism (probably a very different mechanism than what drives current progress).

A big limiting factor is the "cheap to verify" aspect. If AIs could use expensive resources more sample efficiently than humans (while still only being very good at straightforward-but-potentially-expensive-to-verify domains), then the AI R&D speed up would be massive and depending on details this might yield full automation of AI R&D. But, using expensive resources more sample efficiently than top human experts effectively means having research taste matching (or exceeding) top human experts, which seems at least several years away at the current rate of progress on this capability. However, AIs might not need to utilize resources that effectively to yield large speed ups. My understanding is that lots of progress (currently) is from relatively uninteresting (and not that ideation bottlenecked) research. As in, engineers whose taste isn't that great but are very fast would yield large speed ups. With moderate improvements in taste, AIs might resemble such engineers. This usage of AIs would still require humans to be providing ideas and some taste, but AIs could autonomously run with large parts of the project (doing potentially months of work autonomously).

Some aspects of poor resource utilization feel pretty easy to solve (e.g., Opus 4.5 was a bit too miserly while I tend to find that Opus 4.6 is a bit too profligate) but ultimately my best guess is that this requires reasonably good taste and judgment which will be somewhat difficult to achieve. Notably, doing RL at the same resource usage scale as deployment usage of AI won't generally be viable. That said, transfer from smaller resource usage tasks might not be that hard, and some of the RL can involve the AI using resources that are many times more expensive than the RL rollouts themselves (matching a decent fraction of deployment resource usage scale). ^[10]

My experience trying to automate safety research with current models

I've recently been working on trying to automate empirical AI safety projects with AIs both because this would be materially useful (at least while being careful about capabilities externalities) and because this seems useful for better understanding blockers to future automation of safety and safely deferring to AIs. As part of this I wrote an agent orchestrator and various other things to try to make AIs better at this.

Early on, one of my main blockers was that Opus 4.5 would consistently fail to complete the full task with anywhere near the desired level of thoroughness (often skipping large parts of the task). I was able to patch around these instruction following issues in various ways and resolve some other issues through better prompting and scaffolding/orchestration. But, I do still see serious productivity hits from (mundane) misalignment; I have a forthcoming post on misalignment in current models and why I think it's problematic.

Currently, the biggest blocker I have on the (small) projects I'm trying to automate is poor taste/judgment where AIs make somewhat bad choices or consider something good when it actually isn't. I've been able to successfully get this overall system to do reasonably big chunks (for an unaccelerated human maybe the equivalent of a day to a few weeks) of relatively straightforward projects, often compensating for poor taste by getting the AI to complete the project much more thoroughly, effectively doing more low value work.

Overall, I think automating many weeks of mostly pretty-well-specified safety research will soon often work. (This presumably transfers to capabilities research that doesn't require compute heavy experiments, but my experience is most relevant for well-specified safety research.)

My experience seeing if my setup can automate massive ES tasks

I also ran the above setup on trying to ~fully autonomously complete:

2 massive (e.g. would take 3-30 person years) easy-and-cheap-to-verify SWE tasks
A hard easy-and-cheap-to-verify AI R&D task
A few (somewhat esoteric) number-go-up optimization tasks that ARC theory was interested in
Some of METR's harder and less well specified tasks (these weren't fully ES)
Vulnerability finding and end-to-end (cyber) exploitation tasks on relatively hardened targets

I did the first two of these mostly to get a better understanding of AI capabilities. I don't want to say what the exact tasks were in this document for a few reasons. I've generally found AIs able to make quite a bit of autonomous progress, in line with results others have seen, while the amount of progress depends a lot on the details of the task.

(I've done multiple different runs with somewhat different prompts and different scaffolding settings, generally the results are surprisingly invariant to this.)

SWE tasks

I found that the AI successfully completed what looks like many months (3-12 months) of useful work in the SWE projects. In one of these projects it looks like the AIs have beaten or are close to beating a large and moderately complex piece of closed source software in some respects while failing to match it in other respects (and while having various bugs and unimplemented features). For the other, it looks like they may produce something that's pretty impressive but mostly worse than the current best open source project of that type. The code quality is low, but I've since developed approaches that would probably have made the code quality mostly OK (but still not great and likely with some places of very low code quality).

I should say that I did start these projects with a reasonable amount of guidance on what to pursue and what metrics to use and what infrastructure etc. to use which took me around 1-2 hours to write but much of this is amortized over multiple tasks (I've reused this guidance for other tasks) and an hour or two is not that long.

The AI reasonably often makes somewhat bad prioritization choices and has an inclination to consider itself done before trying hard/serious types of improvement. I've had to remind it of metric prioritization, nudge it to keep going, and remind it to periodically clean up its code (even though this was included in instructions). But, by just continuing to iterate, these poor choices aren't catastrophic. I also notice a reasonable amount of misalignment where the AI fails to fully complete various tasks and doesn't keep working, but my scaffolding mostly compensates for this.

I expect that there are pretty large returns to having a human spend 15 minutes giving the AI tips every so often (e.g. every day of calendar time). Often the AIs make mistakes that are pretty obvious to a human who doesn't even have that much state on the project (e.g. due to losing sight of the bigger picture). But the AIs aren't amazing at incorporating advice from what I've seen.

AIs seem especially good at software replication tasks—as in, make a drop-in replacement for this piece of closed source (or open source) software that has some advantage (e.g. speed, security, some feature, etc.). METR and Epoch AI have some forthcoming results on this and I think the performance is even stronger with better scaffolding and prompting.

AI R&D task

The AI R&D task I tested involves improving on something that's already well optimized, so it's pretty hard for the AI to make progress. I tentatively believe the AI made somewhere between a few days and a bit over a week of progress on this task relative to a strong human professional. In practice, it was mostly limited by the AI not being very good at finding good ideas, deciding which ideas to investigate, and allocating time/effort for each idea. It seemed to spend most time trying to eke out gains with tweaking rather than making material improvements. The AI also was pretty resource inefficient and not very good at getting more work done in a limited amount of time due to spending lots of time waiting on many runs. Some of this resource usage could be easily improved with better prompting.

Cyber

AIs are quite good at autonomous cyber, especially with a moderate amount of scaffolding. In part, this is due to having a lot of domain-specific knowledge. I don't want to comment on the exact results I've found (on Opus 4.6) at this time, but this talk by Nicholas Carlini is relevant.

Appendix: Somewhat more detailed updated timelines

I thought it might be helpful to also include some updated timelines in this post.

I'll use the notion of parity from Six Milestones for AI Automation (Cotra):

Parity in a domain: the point when you would be better off firing all humans working in that domain than reverting to 2020-era AI. Humans may still add value in places, and firing them all would still slow things down somewhat, but AIs collectively are more valuable than humans collectively.

I'll also forecast Automated Coder (AC) and Top-Expert-Dominating AI (TEDAI) ^[11] for easy comparison to views at AI Futures Project.

I forecast the following milestones:

AI R&D parity. Parity applied to AI R&D at the leading AI company (where "humans" means everyone who knows how to program and/or has done ML research).
AI stack + conflict parity. Parity applied to all activities relevant to (a) maintaining and improving the AI stack and (b) winning wars (broadly construed). This includes: manufacturing, construction, mining, and other physical industrial tasks; R&D in energy, materials, hardware, biotech, robotics, cyberoffense/defense, etc.; and squishy skills like strategy, tactics, and logistics. Note that this requires the ability to do fully autonomous manufacturing. (I'm allowing for a brief period of adaptation without further AI progress to allow for repurposing robots and manufacturing capacity.)
AC
TEDAI

Note that (1) differs from the operationalization of "Full AI R&D automation" I've used historically; it's a bit weaker. (So my probabilities are correspondingly a bit higher.)

My forecasts (mostly? probably?) don't take into account aggressive policy responses to slow down AI development, but do include "business as usual" regulatory blockers. Possibly this is a mistake.

Date	1. AI R&D parity	2. AI stack + conflict parity	3. AC	4. TEDAI
EOY 2026	7%	3%	11%	4%
EOY 2027	19%	9%	27%	12%
EOY 2028	30%	17%	39%	19%
EOY 2029	40%	25%	48%	26%
EOY 2030	48%	32%	56%	32%
EOY 2031	54%	37%	62%	37%
EOY 2032	58%	42%	66%	42%
EOY 2033	61%	47%	69%	46%
EOY 2034	63%	51%	71%	50%
EOY 2038	70%	61%	77%	58%

My forecasts

AC and TEDAI comparison with AI Futures Project (Daniel Kokotajlo, Eli Lifland)

For comparison, Cotra's median for AI Research Parity (comparable to my AI R&D parity) is early 2030 (slightly before my median of early 2031), and her median for AI Production Parity (comparable to my AI stack + conflict parity, though mine also includes conflict) is mid 2032 (before my median of late 2034).

While I give precise numbers, my views aren't that reflectively stable (e.g. I updated a moderate amount over the last week towards longer timelines after thinking about it a bit!). ^[12]

Note: my median time from some milestone A to some later milestone B is significantly smaller than the difference between my medians for A and B. This is because for right-skewed distributions, the median of a sum is greater than the sum of the medians. Intuitively: each milestone has some chance of taking a very long time (heavy right tail), and these right tails compound when you add delays together, pulling the median of the total time further right than you'd get by just adding the individual medians. So median(B) - median(A) > median(B - A), i.e., the difference between medians overstates the median of the actual time between milestones. ^[13] For instance, I estimate my median time from AI R&D parity (conditioning on this happening before 2035) to TEDAI is maybe around 1.75 years, while the difference between my medians is around 3.5 years. ^[14]

I mostly updated in February 2026 and refined my thinking a bit more in March. ↩︎
I'm at 30% for AI R&D parity (you'd be better off firing all humans working in AI R&D than reverting to using 2020-era AI), but a bit lower for full automation (firing all humans would only slow things down by ~5%), perhaps 26%. ↩︎
As in, don't require coming up with ideas that aren't already on the internet. A key part of the task being discovering somewhat hard-to-find ideas that someone knows but aren't public also makes the task quite a bit harder for models. ↩︎
By "50%-reliability time horizon" I mean: if you randomly sample tasks from the relevant task distribution, this is the time horizon at which the AI has a 50% chance of success. Note that in practice this is mostly driven by variation between tasks (some tasks are harder for the AI than others) than by the model randomly failing on a given task. Thus, it's a bit unnatural to call this reliability. I use the term "reliability" because that's what METR uses and it reads nicely (e.g. "50%-reliability"), though "success rate" might be more accurate. ↩︎
This previously just said METR which was a mistake. ↩︎
My prior view was based on thinking that 2025 was especially fast due to some low-hanging fruit in RL and some of this progress was from increasing cost as a fraction of human cost. I think these factors still hold to a moderate extent, I just expect them to be less important than other factors. ↩︎
We shouldn't double count this: my view is just that AI progress speeds up as you get closer to full automation of AI R&D all else equal and this was already priced into my timelines. (In practice, I expect that all else isn't equal and I expect compute scaling to start slowing within 3 years or so due to production capacity limits or investment slowing as it reaches more extreme levels. I think we're already seeing some signs of hitting compute production issues with DRAM/HBM, though it's worth taking into account adaptation and there being some lag.) ↩︎
By fully CLI, I mean that the task doesn't require vision, computer use, or non-trivial hard-to-programmatically-automate interaction. ↩︎
Note that just updating towards my median for superexponentiality kicking in would have also shortened my timelines; the situation isn't symmetric. The basic reason for this is that my timelines are substantially longer due to a slower tail on many different factors. ↩︎
You can also do online RL, but this has some downsides. ↩︎
I worry their operationalization is a bit weaker than they intended. In addition to their remote work operationalization, I also intend the definition to include beating human experts at any reasonably important R&D domain when doing that work purely remotely. Like, you would certainly prefer hiring the AI over hiring the top human expert in any reasonably important R&D, putting aside physical manipulation. ↩︎
Given this instability, why have so much precision? I tentatively think my precision is actually indicative of very slightly better guesses; e.g. I expect I would do a little worse at forecasting if forced to round to the nearest 5% or 10% while I'm also pretty likely to adjust my guesses a bunch on further reflection and these can be true at the same time. Also, it's nice to have a smooth curve. ↩︎
Here, A and B are random variables that correspond to the year in which some event happens. ↩︎
Also, if A and B-A are correlated (as I think is true for the milestones I discuss here, shorter timelines are correlated with faster takeoff), then conditioning on A having been reached earlier also shrinks the expected remaining time to B. So, if we reach AI R&D parity in mid 2028, then I'd expect a smaller gap to TEDAI. ↩︎

I've seen demonstrations of AIs accomplishing [...] tasks that would take humans months to years. Due to this, I tentatively believe that (as of March 1st) the well-elicited 50% reliability time-horizon on ESNI tasks (using only publicly available models) is somewhere between a month and several years

Once you are talking about scaling to arbitrary inference costs, I think the relevant notion of time horizon is closer to: "For what T could you solve this task using an arbitrary supply of humans each of whom only works on the project for T hours?"

On this framing I think that lots of easily-checkable tasks have much shorter horizons than "amount of time they would take a human to do." That is, I think that you could successfully solve them by delegating to a giant team of low-context humans each of whom makes marginal improvements, and the work from those humans would look quite similar to the AI work.

This is particularly true for reimplementation tasks with predefined tests + a reference implementation. In that setting a human can pick a specific failing test, use the reference implementation to understand the desired behavior, and then implement it while using tests to make sure they don't break anything else. For the projects you are describing it seems plausible that a human with a few days could use that workflow to reliably make forward progress and get the project done in a moderate number of steps.

(I would expect a giant human team using that methodology to make a huge number of mistakes for anything not covered by the tests, and to write code that is low quality and hard to maintain. But I think we're currently seeing roughly that pattern of failures from AI reimplementation.)

It's also very true of vulnerability discovery, since different humans can look in different places and then iteratively zoom in on whatever leads are most promising.

So I'm not sure how much of the phenomenon you're observe is "there is a way longer horizon for these tasks" rather than "a more careful definition of horizon is more important for these tasks." Probably some of both, but It seems quite possible that for the tasks you are describing the horizon length is really more like a few days or a week than months or a year.

I think it's helpful to try to pull these effects apart, particularly when reasoning about the likely effects of continuing RL schlep: I think that you are seeing good performance on these tasks partly because they are easy to decompose and delegate, and not entirely because AI developers are able to do RL on them. For the very "long horizon" tasks you cite, I think that's the dominant effect. So I wouldn't expect performance to be as impressive on tasks where it would be hard to delegate them to a giant team of low-context humans.

I still agree that recent public development seem like evidence for shorter timelines, and I'm not saying my bottom line numbers differ much from yours. Even a more conservative extrapolation of time horizons would suggest more like 2-4 years until you can strictly outperform humans for virtual projects (with human parity somewhat before that).

This is a good point and I broadly agree with what you're saying. Some possible disagreements:

Once you are talking about scaling to arbitrary inference costs, I think the relevant notion of time horizon is closer to: "For what T could you solve this task using an arbitrary supply of humans each of whom only works on the project for T hours?"

Maybe I should have made this clearer, but I'm not talking about scaling to arbitrary inference costs. Instead, I'm talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you'd want to compare the AI performance at some inference budget to human labor with some limit in supply.

So I'm not sure how much of the phenomenon you're observe is "there is a way longer horizon for these tasks" rather than "a more careful definition of horizon is more important for these tasks." Probably some of both, but It seems quite possible that for the tasks you are describing the horizon length is really more like a few days or a week than months or a year.

Yep, it seems reasonably likely that on this alternative notion of time horizon, the horizon length is more like a few days. However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact. For example, an AI with a decomposed-time-horizon of ~1 week could potentially speed up AI R&D a lot, while under the original time-horizon notion, a 1-week 50%-reliability time horizon is much less of a big deal. (To be more precise about the decomposed-time-horizon metric: this would correspond to AI having a 50% chance of matching a team of humans where each human gets 1 week before being replaced, with total labor-hours capped at ~100x what the task would normally require to avoid degeneracies from truly arbitrary inference spend.) So the key question becomes: how much can you accomplish if the task must be heavily decomposed but you can apply much more labor? I'm pretty uncertain about this for the tasks we care about.

Further, AIs may soon end up being especially good (superhuman?) at working in heavily decomposed contexts—e.g., good at leaving notes to other instances, good at quickly picking up project state from limited context. This makes the correspondence to doing task decomposition with humans more fraught, even though AIs will still do relatively better at tasks that are easier to do incrementally or otherwise decompose. I think it's already the case that AIs are extremely good relative to humans at loading up complex state so long as that state is written down (reasonably clearly).

I was already pretty uncertain about how to translate time horizon into downstream impacts, and while this alternative metric might be more consistent across different groups of tasks, it makes translating to downstream impact harder. An alternative is just explicitly thinking about the time horizon for different buckets of tasks using the original notion.

Regardless, I agree that AI capabilities may be better understood using this alternative time-horizon notion, and this does closely correspond to how AIs are actually being used. (My scaffold is mostly just doing task decomposition, and something like this is effectively required for good performance given current AI properties.)

Maybe I should have made this clearer, but I'm not talking about scaling to arbitrary inference costs. Instead, I'm talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you'd want to compare the AI performance at some inference budget to human labor with some limit in supply.

I agree. In the more realistic regime you are talking about you have some more complicated quantitative question around how large are the slowdowns from task decomposition into what scale.

My main point was that for the tasks we are talking about here, the slowdowns seem like they might not be that large even for modest human time horizons. (In contrast with some of the crazy factored cognition stuff we have sometimes talked about, which involves much shorter horizons, much harder-to-decompose tasks, and much larger slowdowns.)

However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact.

I agree that this could lead to large impacts with relatively short horizons (perhaps even today's horizons, with an appropriately broadened training distribution and a bunch of schlep). That does imply a different picture of AI strengths and weaknesses (e.g. weaker on-the-job learning with performance mostly limited to domains near the training distribution; differential speedup for tasks that are easily decomposed), with a more schlep-heavy singularity, a greater role for tight human involvement later in the process, and probably less alignment concern earlier in the trajectory.

I think it's plausible current AI mostly accelerates the picking of low-hanging fruit downstream of having access to some level of compute. The things that could be achieved in 2 years with current compute are achieved faster, but the things that could be achieved in 15 years won't happen meaningfully faster. And so the effects of gaining more compute are more important overall than the details of how quickly current AI expresses them as novel AI capabilities (while compute still keeps scaling rapidly, until 2028-2030 or so).

That is, if the next milestone that qualitatively improves AI (which might include becoming immediately RSI capable) would be achieved in 2029 without AI assistance (on the R&D side), maybe it's achieved in 2028 with AI assistance, or in early 2028 with superexponential progress or whatever. But if it would instead only be achieved in 2035 without AI assistance, it's still only achieved in 2034-2035 with AI assistance, because it's bottlenecked by things current AI is unable to meaningfully help with, which aren't low-hanging fruit that becomes available with a new level of compute.

The surprising capabilities of current AI translate to a surprising TAM, and so a surprising amount of compute in the near future. It's still only maybe 10x higher than otherwise (by 2031-2035), which is not that much considering the likely 1500x growth over 2022-2028, where it would otherwise mostly stop if revenues plateaued at about $100bn per AI company. This somewhat increases the chance that the next milestones of qualitatively better AI are reached in the near future rather than in a more distant future (a hard-to-predict but plausibly nontrivial number of years after the scaling of compute mostly stops and the low-hanging fruit is all picked). But the involvement of AI in this change from observed surprising capabilities (according to the model I'm sketching) is more importantly financial rather than through AI itself helping with AI R&D.

The things that could be achieved in 2 years with current compute are achieved faster, but the things that could be achieved in 15 years won't happen meaningfully faster

If you haven't seen Tom Cunningham's new blog post on what I think is a very related thesis, you may be interested

I don't disagree overall (Victories continue to come closer to home), but to whackamole a specific datapoint, how much critique of the Claude C Compiler did you read? https://harshanu.space/en/tech/ccc-vs-gcc/ was on top of HN for a day and I believe Zvi reported on it so I'd imagine you have.

But to quickly summarize, CCC only implemented the "easy" parts, directly copied/transpiled the hard parts of that easier part, and most importantly is over 100,000 times slower than an actual compiler! I feel this generally smells of the types of distorted truth that you should expect to omnipresently manifest when literal trillions of capital are at stake.

I think the original Anthropic post is reasonably accurate about the limitations of the project and I don't see how this blog post contradicts claims in the original post?

I think that this C compiler represents roughly a year of work for someone and that the issues here aren't a crux for this. TBC, by 'CCC only implemented the "easy" parts' you just mean that it only build a non-optimizing compiler and didn't build the linker or assembler? This was explicit in the original post.

(FWIW, my view is the the hardest part is optimization (while still being correct), then the compiler itself (made harder if you also have good error messages etc which CCC certainly doesn't have), then the linker / assembler. I don't buy that the linker/assembler are harder than a compiler that can correctly compile projects like the linux kernel.)

most importantly is over 100,000 times slower than an actual compiler

Yes, this isn't surprising as it's a non-optimizing compiler? (It's also quite a bit slower even at -O0 from my understanding.)

I'm a bit confused by what your argument here is. I don't think anyone was claiming that CCC was anywhere near a full drop in replacement for gcc/clang? TBC, I do think it would represent a much stronger result if it was a full drop in replacement for gcc/clang, like this is presumably >>10x as much human effort.

Hmm, well yes I suppose my main point is that I personally don’t believe it represents a year of work, but worry the pro-marketing slant of the Anthropic blog post pushes this perspective moreso than a cynic’s breakdown, which I wanted to make sure you were aware of. (There are ofc other analyses than the HN discussed one, see 1, 2)

But I certainly believe it replaced a month+, so as long as it’s now confirmed we are operating under similar information it becomes just a minor quibble over a singular OOM in the interpretation, which seems fine to me.

(Assuming this is true) One would wish for LW to be one of the earliest places for Truth to master pants, but this hasn't happened quite yet...

While I can't exactly say I'm excited about AI doing massive ES tasks, it could be good news for interpretability research that depends on exhaustively searching for verified explanations of an AI's behaviors or features. This post on automated SAE research comes to mind.

I'm particularly interested in scaffold optimization for interpretability. E.g. given a task domain, a trusted model T (small, or hasn't been RLed on the domain), and an untrusted model U (large, or has been RLed on the domain), can we "distill" U into interpretable code that explains to T how to act like U or achieve higher reward? We can verify the scaffold by running it against a validation set and measuring reward or KL divergence from U (easy; may or may not be cheap).

Why is TEDAI (operationalised as remote-work-only if I understand) so similar to AI stack + conflict (includes full supply chain self-sufficiency, warfighting, real-time operational tactics) for you? Is this just superexponentials eating up all the gaps? How come TEDAI is less likely later on (but more likely sooner)?

AI stack + conflict parity requires lots of robots (or crazy novel tech) but doesn't require AIs as capable as TEDAI. TEDAI is a very high capabilities bar. So, in worlds without a software only singularity and especially with slower takeoff, I think you may reach AI stack + conflict parity prior to TEDAI. (It's certainly possible to have great military robots and robot industrial capacity with AIs that are well within the human range on key skills.) TEDAI probably follows reasonably quickly, because economic doubling times are so fast in such a world. In the absence of strong coordination, I think you reach AI stack + conflict parity pretty shortly after TEDAI given how capable these AIs would be. So, my medians for these milestones are similar.

Shorter timelines to TEDAI correlate with faster takeoff, making it more likely you get the AI capabilities before the robots. Longer timelines correlate with slower takeoff making it more likely you get lots of robots + industrial capacity before reaching TEDAI.

I don't feel very stable in my picture of the industrial explosion because not that much serious analysis has been done. So, my views on these relative milestones might shift a bunch on further reflection.

The recent improvements in practical utility of AI have decreased my probability of near-term recursive self-improvement. I'm updating towards a Chollet-like view. I'm increasingly convinced that LLMs really are a collection of fairly dumb heuristics (surface-level pattern matching) that are intelligent through breadth, and that there's a qualitative difference from human intelligence. We use fewer heuristics but each one is smarter (more generally applicable, abstract, robust).

In the past it was hard to distinguish whether LLMs were developing smarter heuristics with each improvement or just deploying more and better-selected dumb heuristics. "Needs smart heuristics to solve" and "really hard to do" are highly correlated properties of tasks, so every capabilities gain looks like evidence for model general smartness. Now that we see really hard tasks being done we can differentiate the two properties, so I revised my prior capabilities-based timeline updates down. I worry people are now making the same conflation between "Needs smart heuristics to solve" and "unverifiable."

Problems that are bottlenecked by deploying smart heuristics won't be accelerated as much as people expect. AI research is one of them, as evidenced by labs paying extreme premiums for research taste rather than massively expanding headcount.

I'm wondering if we need to sharpen the "story" of the ESNI distribution to fit your observation that the 50% and 90% horizons are so far apart. From a first read, my mental model was that an ESNI task is "a bunch of easy steps, each verifiable against tests," with a low per-step failure rate, with failures generally being catchable by tests, leading to retries. Under this model, I'd expect subsets of the task to be approximately independent, so I'd expect the 50% time horizon to be roughly log(0.5)/log(0.9) ≈ 6.58 times the 90% time horizon. You are claiming a much larger ratio (at least years to days, possibly decades to hours), so something else is going on here.

The simplest underlying model I can come up with is that in your actual ESNI distribution, any given task has some probability of having a "trap": something that requires ideation or taste (and therefore maybe makes it not really an ESNI task per the original definition), which causes the LLM to fail. And the probability a supposedly-ESNI task contains a trap is very slowly growing (e.g., logarithmic) in the time it would take a human to do the task?

Yeah, my view is that a subset of the ESNI task distribution I'm considering have some part that AIs tend to have a hard time with and this makes the fraction that the AIs succeed on lower.

something that requires ideation or taste (and therefore maybe makes it not really an ESNI task per the original definition)

Ultimately, this is going to be quantitative and some ideation/taste is basically always required. So, part of the variation might be the extent to which the task required these and the AI was bad at that particular taste/ideation.

This isn't relevant to the thrust of your post, but I haven't seen anyone operationalize military capabilities so clearly in their AI timelines. It drives the point home about x-risk a little harder when the entry after automated AI R&D parity is a thing like "8 years till 50-50 odds we have Skynet."

I've seen demonstrations of AIs accomplishing [...] tasks that would take humans months to years. Due to this, I tentatively believe that (as of March 1st) the well-elicited 50% reliability time-horizon on ESNI tasks (using only publicly available models) is somewhere between a month and several years

It's also very true of vulnerability discovery, since different humans can look in different places and then iteratively zoom in on whatever leads are most promising.

This is a good point and I broadly agree with what you're saying. Some possible disagreements:

Once you are talking about scaling to arbitrary inference costs, I think the relevant notion of time horizon is closer to: "For what T could you solve this task using an arbitrary supply of humans each of whom only works on the project for T hours?"

So I'm not sure how much of the phenomenon you're observe is "there is a way longer horizon for these tasks" rather than "a more careful definition of horizon is more important for these tasks." Probably some of both, but It seems quite possible that for the tasks you are describing the horizon length is really more like a few days or a week than months or a year.

Maybe I should have made this clearer, but I'm not talking about scaling to arbitrary inference costs. Instead, I'm talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you'd want to compare the AI performance at some inference budget to human labor with some limit in supply.

I agree. In the more realistic regime you are talking about you have some more complicated quantitative question around how large are the slowdowns from task decomposition into what scale.

However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact.

The things that could be achieved in 2 years with current compute are achieved faster, but the things that could be achieved in 15 years won't happen meaningfully faster

If you haven't seen Tom Cunningham's new blog post on what I think is a very related thesis, you may be interested

I think the original Anthropic post is reasonably accurate about the limitations of the project and I don't see how this blog post contradicts claims in the original post?

most importantly is over 100,000 times slower than an actual compiler

Yes, this isn't surprising as it's a non-optimizing compiler? (It's also quite a bit slower even at -O0 from my understanding.)

(Assuming this is true) One would wish for LW to be one of the earliest places for Truth to master pants, but this hasn't happened quite yet...

Yeah, my view is that a subset of the ESNI task distribution I'm considering have some part that AIs tend to have a hard time with and this makes the fraction that the AIs succeed on lower.

something that requires ideation or taste (and therefore maybe makes it not really an ESNI task per the original definition)

200

AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines

200

Ω 65

What's going on with these easy-and-cheap-to-verify tasks?

Some evidence against shorter timelines I've gotten in the same period

Why does high performance on ESNI tasks shorten my timelines?

How much does extremely high performance on ESNI tasks help with AI R&D?

My experience trying to automate safety research with current models

My experience seeing if my setup can automate massive ES tasks

SWE tasks

AI R&D task

Cyber

Appendix: Somewhat more detailed updated timelines

200

Ω 65

200

Ω 65