Comment Permalink

Answer by DavidmanheimMar 20, 20251-8

8 hours of clock time for an expert seems likely to be enough to do anything humans can do; people rarely productively work in longer chunks than that, and as long as we assume models are capable of task breakdown and planning, (which seems like a non trivial issue, but an easier one than the scaling itself,) that should allow it to parallelize and serialize chucks to do larger human-type tasks.

But it's unclear alignment can be solved by humans at all, and even if it can, of course, there is no reason to think these capabilities would scale as well or better for alignment than for capabilities and self-improvement, so this is not at all reassuring to me.

See in context

19

[ Question ]

How far along Metr's law can AI start automating or helping with alignment research?

by Christopher King

20th Mar 2025

1 min read

6 21

19

In METR: Measuring AI Ability to Complete Long Tasks found a Moore's law like trend relating (model release date) to (time needed for a human to do a task the model can do).

Here is their rationale for plotting this.

Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most exam-style problems for a fraction of the cost. With some task-specific adaptation, they can also serve as useful tools in many applications. And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. They are unable to reliably handle even relatively low-skill, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.

It seems that AI alignment research falls into this. The LLMs clearly have enough "expertise" at this point, but doing any kind of good research takes an expert a lot of time, even when it is purely on paper.

It seems therefore that we could use Metr's law to predict when AI will be capable of alignment research. Or at least when it could substantially help.

My question is what time t does "automatically do tasks that humans can do in t" let us do enough research to solve the alignment problem?

(Even if you're not a fan of automating alignment, if we do make it to that point we might as well give it a shot!)

AI EvaluationsAI TimelinesAI-Assisted AlignmentIntellectual Progress (Society-Level)METR (org)Research AgendasAI

Frontpage

19

New Answer

New Comment

6 Answers sorted by
top scoring

Jacob Pfau

Mar 21, 2025

To apply METR's law we should distinguish conceptual alignment work from well-defined alignment work (including empirics and theory on existing conjectures). The METR plot doesn't tell us anything quantitative about the former.

As for the latter, let's take interpretability as an example: We can model uncertainty as a distribution over the time-horizon needed for interpretability research e.g. ranging over 40-1000 hours. Then, I get 66% CI of 2027-2030 for open-ended interp research automation--colab here. I've written up more details on this in a post here.

Thomas Kwa

Mar 21, 2025

Author here. My best guess is that by around the 1-month point, AIs will be automating large parts of both AI capabilities and empirical alignment research. Inferring anything more depends on many other beliefs.

Currently no one knows how hard the alignment problem is or what exactly good alignment research means-- it is the furthest-looking, least well-defined and least tractable of the subfields of AI existential safety. This means we don't know the equivalent task length of the alignment problem. Even more importantly, we only measured the AIs at software tasks and don't know what the trend is for other domains like math or law, it could be wildly different.

With that said, my current guess is that alignment will be sped up by AI slightly less than capabilities will be, success looks like building deferrable AI, and whether we succeed depends on whether the world dedicates more than X% [1] of AI research resources to relevant safety research than the exact software time horizon of the AIs involved, which is not directly applicable.

[1] X is some unknown number probably between 0% and 65%

[-]Garrett Baker4d20

Even more importantly, we only measured the AIs at software tasks and don't know what the trend is for other domains like math or law, it could be wildly different.

You probably mention this somewhere, but I'll ask here, are you currently researching whether these results hold for those other domains? I'm personally more interested about math than law.

4Thomas Kwa4d

It's expensive to construct and baseline novel tasks for this (we spent well over $100k on human baselines) so what we are able to measure in the future depends on whether we can harvest realistic tasks that naturally have human data. You could do a rough analysis on math contest problems, say assigning GSM8K and AIME questions lengths based on a guess of how long expert humans take, but the external validity concerns are worse than for software. For one thing, AIME has much harder topics than GSM8K (we tried to make SWAA not be artificially easier or harder than HCAST); for another, neither are particularly close to the average few minutes of a research mathematician's job.

Rafael Harth

Mar 20, 2025

2-1

Not answerable because METR is a flawed measure, imho.

Alice Blair

Mar 21, 2025

This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I'm able to get into here. I don't think there's a publicly accessible full version yet (but I think there will be at some later point).

It much more targets the question "when will we have AIs that can automate work at AGI companies?" which I realize is not really your pointed question. I don't have a good answer to your specific question because I don't know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement.

However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research.

I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.

Davidmanheim

Mar 20, 2025

1-8

tailcalled

Mar 20, 2025

0-7

Life on earth started 3.5 billion years ago. Log_2(3.5 billion years/1 hour) = 45 doublings. With one doubling every 7 months, that makes 26 years, or in 2051.

(Obviously this model underestimates the difficulty of getting superalignment to work. But also extrapolating the METR trend is questionable for 45 doublings is dubious in an unknown direction. So whatever.)

[-]Christopher King4d10

You're saying that if you assigned 1 human contractor the task of solving superalignment, they would succeed after ~3.5 billion years of work? 🤔 I think you misunderstood what the y-axis on the graph is measuring.

4tailcalled4d

I mean I don't really believe the premises of the question. But I took "Even if you're not a fan of automating alignment, if we do make it to that point we might as well give it a shot!" to imply that even in such a circumstance, you still want me to come up with some sort of answer.

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:01 PM

[-]johnswentworth4d61

Fun point! If I just pull some numbers out of my ass and naively plug them into the equation:

Superalignment-relevant research tasks take time on the order of a year to a decade.
Taking 1 hr as the current task-time models can handle, we'd need ~13-16 doublings.
... so on METR's model, that's about 8-10 years from now.

Of course that's assuming the model holds. A useful heuristic for trends which have just been published for the first time is that they break down immediately after publication. If it holds up for another year or two, then the model will look a lot more plausible.

[-]IC Rainbow4d30

I don't think they're blocked by an inability to run autonomously. They're blocked by lacking an eye for novelty/interestingness. You can make the slop factory to run 24/7 for a year and still not get any closer to solving alignment.

[-]mishka4d20

Mmm... if we are not talking about full automation, but about being helpful, the ability to do 1-hour software engineering tasks ("train classifier") is already useful.

Moreover, we had seen a recent flood of rather inexpensive fine-tunings of reasoning models for a particular benchmark.

Perhaps, what one can do is to perform a (somewhat more expensive, but still not too difficult) fine-tuning to create a model to help with a particular relatively narrow class of meaningful problems (which would be more general than tuning for particular benchmarks, but still reasonably narrow). So, instead of just using an off-the-shelf assistant, one should be able to upgrade it to a specialized one.

For example, I am sure that it is possible to create a model which would be quite helpful with a lot of mechanistic interpretability research.

So if we are taking about when AIs can start automating or helping with research, the answer is, I think, "now".

[-]Gunnar_Zarncke4d1-1

I found this tweet helpful that does the same regression on another dataset - chess - and arrives at an absurd conclusion. For me, the result is that LLMs may soon be able to handle very big software engineering tasks, but that will likely not generalize to arbitrary tasks. Longer more general tasks might still follow soon after but you can't reliably predict this with this single dataset alone.

[-]Rafael Harth3d20

I don't think I get it. If I read this graph correctly, it seems to say that if you let a human play chess against an engine and want it to achieve equal performance, then the amount of time the human needs to think grows exponentially (as the engine gets stronger). This doesn't make sense if extrapolated downward, but upward it's about what I would expect. You can compensate for skill by applying more brute force, but it becomes exponentially costly, which fits the exponential graph.

It's probably not perfect -- I'd worry a lot about strategic mistakes in the opening -- but it seems pretty good. So I don't get how this is an argument against the metric.

[-]Gunnar_Zarncke3d20

It is a decent metric for chess but a) it doesn't generalize to other tasks (as people seem to interpret the METR paper), and less importantly, b) I'm quite confident that people wouldn't beat the chess engines by thinking for years.

[-]Christopher King4d10

What is the absurd conclusion?

[-]Gunnar_Zarncke4d20

That we would have AIs performing year-long tasks in 2005. Chess is not the same as software engineering but it is still a limited domain.

[-]Christopher King4d10

I mean, beating a chess engine in 2005 might be a "years-long task" for a human? The time METR is measuring is how long it would hypothetically take a human to do the task, not how long it takes the AI.

[-]Gunnar_Zarncke3d20

Yes, but it didn't mean that AIs could do all kinds of long tasks in 2005. And that is the conclusion many people seem to draw from the METR paper.

[-]Gunnar_Zarncke4d20

No? It means you can't beat the chess engine.

And even if - they try to argue in the other direction: If it takes the human time X at time T it will take the AI duration L. That didn't work for chess either.

Moderation Log

19

19

19

6 Answers sorted by top scoring

6 Answers sorted by
top scoring