Comment Permalink

Daniel Kokotajlo2dΩ31907

This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. Eli Lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We've got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can't do tasks that take professional humans 160 years? And it's going to take 4 more doublings to get there? And these 4 doublings are going to take 2 more years to occur? No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.

Also, zooming in mechanistically on what's going on, insofar as an AI system can do tasks below length X but not above length X, it's gotta be for some reason -- some skill that the AI lacks, which isn't important for tasks below length X but which tends to be crucial for tasks above length X. But there are only a finite number of skills that humans have that AIs lack, and if we were to plot them on a horizon-length graph (where the x-axis is log of horizon length, and each skill is plotted on the x-axis where it starts being important, such that it's not important to have for tasks less than that length) the distribution of skills by horizon length would presumably taper off, with tons of skills necessary for pretty short tasks, a decent amount necessary for medium tasks (but not short), and a long thin tail of skills that are necessary for long tasks (but not medium), a tail that eventually goes to 0, probably around a few years on the x-axis. So assuming AIs learn skills at a constant rate, we should see acceleration rather than a constant exponential. There just aren't that many skills you need to operate for 10 days that you don't also need to operate for 1 day, compared to how many skills you need to operate for 1 hour that you don't also need to operate for 6 minutes.

There are two other factors worth mentioning which aren't part of the above: One, the projected slowdown in capability advances that'll come as compute and data scaling falters due to becoming too expensive. And two, pointing in the other direction, the projected speedup in capability advances that'll come as AI systems start substantially accelerating AI R&D.

Showing 3 of 23 replies (Click to show all)

Daniel Kokotajlo4hΩ444

I don't believe it. I don't believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don't think coding is really 3x faster on average for the things that matter.

1anaguma11h

Isn’t the quadratic cost of context length a constraint here? Naively you’d expect that acting coherently over 100 years would require 10x the context, and therefore 100x the compute/memory, than 10 years.

5Thomas Kwa9h

Humans don't need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn't a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.

See in context

179 METR: Measuring AI Ability to Complete Long Tasks

by Zach Stein-Perlman

19th Mar 2025

AI Alignment Forum

1 min read

179 Ω 61

This is a linkpost for https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts.
Full paper | Github repo

Blogpost; tweet thread.

Frontpage

179 Ω 61

Mentioned in

19How far along Metr's law can AI start automating or helping with alignment research?

New Comment

53 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:33 AM

[-]Daniel Kokotajlo2dΩ31907

[-]Matt Goldenberg2dΩ5186

I'm not at all convinced it has to be something discrete like "skills" or "achieved general intelligence".

There are many continuous factors that I can imagine that help planning long tasks.

[-]J Bostock2d80

I second this, it could easily be things which we might describe as "amount of information that can be processed at once, including abstractions" which is some combination of residual stream width and context length.

Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there's a maximum abstraction stack height which a mind can handle, which varies continuously.

[-]Daniel Kokotajlo2dΩ230

I'm not sure if I understand what you are saying. It sounds like you are accusing me of thinking that skills are binary--either you have them or you don't. I agree, in reality many skills are scalar instead of binary; you can have them to greater or lesser degrees. I don't think that changes the analysis much though.

[-]Matt Goldenberg2d40

length X but not above length X, it's gotta be for some reason -- some skill that the AI lacks, which isn't important for tasks below length X but which tends to be crucial for tasks above length X.

My point is, maybe there are just many skills that are at 50% of human, then go up to 60%, then 70%, etc, and can keep going up linearly to 200% or 300%. It's not like it lacked the skill then suddenly stopped lacking it, it just got better and better at it

[-]Daniel Kokotajlo1d2-1

I agree with that, in fact I think that's the default case. I don't think it changes the bottom line, just makes the argument more complicated.

[-]Matt Goldenberg1d21

I don't see how the original argument goes through if it's by default continuous.

[-]Petropolitan2d150

One of non-obvious but very important skills which all LLM-based SWE agents currently lack is reliably knowing which subtasks of a task you have successfully solved and which you have not. I think https://www.answer.ai/posts/2025-01-08-devin.html is a good case in point.

We have absolutely seen a lot of progress on driving down hallucinations on longer and longer contexts with model scaling, they probably made the charts above possible in the first place. However, recent research (e. g., the NoLiMa benchmark from last month https://arxiv.org/html/2502.05167v1) demonstrates that effective context length falls far short of what is advertised. I assume it's not just my personal experience but common knowledge among the practitioners that hallucinations become worse the more text you feed to an LLM.

If I'm not mistaken even with all the optimizations and "efficient" transformer attempts we are still stuck (since GPT-2 at least) with self-attention + KV-cache (originally known as "past cache" after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238; its invention has not been described in the literature AFAIK) which scale (at inference) linearly as long as you haven't run out of memory and quadratically afterwards. Sure, MLA have just massively ramped up the context length at which the latter happens but it's not unlimited, you won't be able to cache, say, one day of work (especially since DRAM has not been scaling exponentially for years https://semianalysis.substack.com/p/the-memory-wall).

People certainly will come up with ways to optimize long-context performance further, but it doesn't have to continue scaling in the same way it has since 2019.

[-]jsteinhardt2dΩ611-5

Doesn't the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).

[-]Daniel Kokotajlo1dΩ7101

I indeed think that AI assistance has been accelerating AI progress. However, so far the effect has been very small, like single-digit percentage points. So it won't be distinguishable in the data from zero. But in the future if trends continue the effect will be large, possibly enough to more than counteract the effect of scaling slowing down, possibly not, we shall see.

[-]jsteinhardt1dΩ36-7

Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.

(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

[-]habryka1dΩ101417

Research engineers I talk to already report >3x speedups from AI assistants

Huh, I would be extremely surprised by this number. I program most days, in domains where AI assistance is particularly useful (frontend programming with relatively high churn), and I am definitely not anywhere near 3x total speedup. Maybe a 1.5x, maybe a 2x on good weeks, but definitely not a 3x. A >3x in any domain would be surprising, and my guess is generalization for research engineer code (as opposed to churn-heavy frontend development) is less.

[-]Ben Pace1dΩ220

I think my front-end productivity might be up 3x? A shoggoth helped me building a stripe shop and do a ton of UI design that I would’ve been hesitant to take on myself (without hiring someone else to work with), as well as quality increase in speed of churning through front-end designs.

(This is going from “wouldn’t take on the project due to low skill” to “can take it on and deliver it in a reasonable amount of time”, which is different from “takes top programmer and speeds them up 3x”.)

[-]elifland11hΩ61012

I agree with habryka that the current speedup is probably substantially less than 3x.

However, worth keeping in mind that if it were 3x for engineering the overall AI progress speedup would be substantially lower, due to (a) non-engineering activities having a lower speedup, (b) compute bottlenecks, (c) half of the default pace of progress coming from compute.

My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential

Exponential growth alone doesn't imply a significant effect here, if the current absolute speedup is low.

[-]Daniel Kokotajlo4hΩ444

[-]osten2d51

Ok, but why do you think that AIs learn skills at a constant rate? Might it be that higher level skills need more time to learn because compute scales exponentially with time but for higher level skills data is exponentially more scarce and needs linearly in task length more context, that is, total data processed scales superexponentially with task level?

[-]Mo Putera2d42

Ben West's remark in the METR blog post seems to suggest you're right that the doubling period is shortening:

... there are reasons to think that recent trends in AI are more predictive of future performance than pre-2024 trends. As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.

[-]Siebe2d2-14

One way to operationalize "160 years of human time" is "thing that can be achieved by a 160-person organisation in 1 year", which seems like it would make sense?

[-]ErioirE2d51

Unfortunately, when dealing with tasks such as software development it is nowhere near as linear as that.

The meta-tasks of each additional dev needing to be brought up to speed on the intricacies of the project, as well as lost efficiency from poor communication/waiting on others to finish things means you usually get diminishing (or even inverse) returns from adding more people to the project.
See: The Mythical Man Month

[-]Mo Putera2d10

Not if some critical paths are irreducibly serial.

[-]Rachel Shu2d10

Possibly, but then you have to consider you can spin up possibly arbitrarily many instances of the LLM as well, in which case you might expect the trend to go even faster, as now you’re scaling on 2 axes, and we know parallel compute scales exceptionally well.

Parallel years don’t trade off exactly with years in series, but “20 people given 8 years” might do much more than 160 given one, or 1 given 160, depending on the task.

[-]anaguma11h10

No, at some point you "jump all the way" to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.

[-]Thomas Kwa9h52

[-]satwik2d10

Any slowdown seems implausible given Anthropic timelines, which I consider to be a good reason to be skeptical of data and compute cost-related slowdowns at least until nobel-prize level. Moreover, the argument that we will very quickly get 15 OOMs or whatever of effective compute after the models can improve themselves is also very plausible

[-]GeneSmith2d3813

In the last year it has really hit me at a personal level what graphs like these mean. I'm imagining driving down to Mountain View and a town once filled with people who had "made it" and seeing a ghost town. No more jobs, no more prestige, no more promise of a stable life. As the returns to capital grow exponentially and the returns to labor decline to zero, the gap between the haves and the have-nots will only grow.

If someone can actually get superintelligence to do what they want, then perhaps universal basic income can at the very least prevent actual starvation and maybe even provide a life of abundance.

But I can't help but feeling such a situation is fundamentally unstable. If the government's desires become disconnected from those of the people at any point, by what mechanism can balance be restored?

In the past the government was fundamentally reliant on its citizens for one simple reason; citizens produced taxable revenue.

That will no longer be the case. Every country will become a petro state on steroids.

[-]No77e2d21

I'm imagining driving down to Mountain View and a town once filled with people who had "made it" and seeing a ghost town

I'm guessing that people who "made it" have a bunch of capital that they can use to purchase AI labor under the scenario you outline (i.e., someone gets superintelligence to do what they want).

But I can't help but feeling such a situation is fundamentally unstable. If the government's desires become disconnected from those of the people at any point, by what mechanism can balance be restored?

I'm not sure I'm getting the worry here. Is it that the government (or whoever directs superintelligences) is going to kill the rest because of the same reasons we worry about misaligned superintelligences or that they're going to enrich themselves while the rest starves (but otherwise not consuming all useful resources)? If that's this second scenario you're worrying about, that seems unlikely to me because even as a few parties hit the jackpot, the rest can still deploy the remaining capital they have. Even if they didn't have any capital to purchase AI labor, they would still organize amongst themselves to produce useful things that they need, and they would form a different market until they also get to superintelligence, and in that world, it should happen pretty quickly.

[-]Daphne_W1d10

I'm guessing that people who "made it" have a bunch of capital that they can use to purchase AI labor under the scenario you outline (i.e., someone gets superintelligence to do what they want).

If the superintelligence is willing to deprive people of goods and services because they lack capital, then why would it be empathetic towards those that have capital? The superintelligence would be a monopsony and monopoly, and could charge any amount for someone existing for an arbitrarily short amount of time. Assuming it even respects property law when it is aligned with its creators.

Is it that the government (or whoever directs superintelligences) is going to kill the rest because of the same reasons we worry about misaligned superintelligences

"Kill" is such a dirty word. Just not grant them the means to sustain themselves.

or that they're going to enrich themselves while the rest starves (but otherwise not consuming all useful resources)? If that's this second scenario you're worrying about, that seems unlikely to me because even as a few parties hit the jackpot, the rest can still deploy the remaining capital they have. Even if they didn't have any capital to purchase AI labor, they would still organize amongst themselves to produce useful things that they need, and they would form a different market until they also get to superintelligence, and in that world, it should happen pretty quickly.

Why would capital owners with a superintelligence ever let those without capital build their own superintelligence? That sounds like a recipe for AI war - are the poors really going to program their superintelligence with anything other than the fundamental rejection of the concept of capital ownership in a post-scarcity society?

[-]ErioirE2d0-3

Government is also reliant on its citizens to not violently protest, which would happen if it got to the point you describe.

The idealist in me hopes that eventually those with massive gains in productivity/wealth from automating everything would want to start doing things for the good of humanity™, right?
...Hopefully that point is long before large scale starvation.

[-]otto.barten2d10

Have we eventually solved world hunger by giving 1% of GDP to the global poor?

Also, note it's not obvious that ASI can be aligned.

[-]Nikola Jurkovic2dΩ9236

This has been one of the most important results for my personal timelines to date. It was a big part of the reason why I recently updated from ~3 year median to ~4 year median to AI that can automate >95% of remote jobs from 2022, and why my distribution overall has become more narrow (less probability on really long timelines).

[-]No77e2d30

Naively extrapolating this trend gets you to 50% reliability of 256-hour tasks in 4 years, which is a lot but not years-long reliability (like humans). So, I must be missing something. Is it that you expect most remote jobs not to require more autonomy than that?

[-]Zach Stein-Perlman2d76

I think doing 1-week or 1-month tasks reliably would suffice to mostly automate lots of work.

[-]Nikola Jurkovic2d53

I expect the trend to speed up before 2029 for a few reasons:

AI accelerating AI progress once we reach 10s of hours of time horizon.
The trend might be "inherently" superexponential. It might be that unlocking some planning capability generalizes very well from 1-week to 1-year tasks and we just go through those doublings very quickly.

[-]Daniel Kokotajlo2d42

Indeed I would argue that the trend pretty much has to be inherently superexponential. My argument is still kinda fuzzy, I'd appreciate help in making it more clear. At some point I'll find time to try to improve it.

[-]Thomas Kwa1d42

The trend probably sped up in 2024. If the future trend follows the 2024--2025 trend, we get 50% reliability at 167 hours in 2027.

[-]MichaelDickens2d10

Why do you think this narrows the distribution?

I can see an argument for why, tell me if this is what you're thinking–

The biggest reason why LLM paradigm might never reach AI takeoff is that LLMs can only complete short-term tasks, and can't maintain coherence over longer time scales (e.g. if an LLM writes something long, it will often start contradicting itself). And intuitively it seems that scaling up LLMs hasn't fixed this problem. However, this paper shows that LLMs have been getting better at longer-term tasks, so LLMs probably will scale to AGI.

[-]Rafael Harth2dΩ4105

I really don't think this is a reasonable measure for ability to do long term tasks, but I don't have the time or energy to fight this battle, so I'll just register my prediction that this paper is not going to age well.

[-]Julian Bradshaw2d90

Here's an interesting thread of tweets from one of the paper's authors, Elizabeth Barnes.
Quoting the key sections:

Extrapolating this suggests that within about 5 years we will have generalist AI systems that can autonomously complete ~any software or research engineering task that a human professional could do in a few days, as well as a non-trivial fraction of multi-year projects, with no human assistance or task-specific adaptations required.

However, (...) It’s unclear how to interpret “time needed for humans”, given that this varies wildly between different people, and is highly sensitive to expertise, existing context and experience with similar tasks. For short tasks especially, it makes a big difference whether “time to get set up and familiarized with the problem” is counted as part of the task or not.

(...)

We’ve tried to operationalize the reference human as: a new hire, contractor or consultant; who has no prior knowledge or experience of this particular task/codebase/research question; but has all the relevant background knowledge, and is familiar with any core frameworks / tools / techniques needed.

This hopefully is predictive of agent performance (given that models have likely memorized most of the relevant background information, but won’t have training data on most individual tasks or projects), whilst maintaining an interpretable meaning (it’s hopefully intuitive what a new hire or contractor can do in 10 mins vs 4hrs vs 1 week).

(...)

Some reasons we might be *underestimating* model capabilities include a subtlety around how we calculate human time. In calculating human baseline time, we only use successful baselines. However, a substantial fraction of baseline attempts result in failure. If we use human success rates to estimate the time horizon of our average baseliner, using the same methodology as for models, this comes out to around 1hr - suggesting that current models will soon surpass human performance. (However, we think that baseliner failure rates are artificially high due to our incentive scheme, so this human horizon number is probably significantly too low)

Other reasons include: For tasks that both can complete, models are almost always much cheaper, and much faster in wall-clock time, than humans. This also means that there's a lot of headroom to spend more compute at test time if we have ways to productively use it - e.g. BoK

That bit at the end about "time horizon of our average baseliner" is a little confusing to me, but I understand it to mean "if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can't reliably perform tasks that take longer than an hour". Which is a pretty interesting point.

[-]Thomas Kwa1d42

That bit at the end about "time horizon of our average baseliner" is a little confusing to me, but I understand it to mean "if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can't reliably perform tasks that take longer than an hour". Which is a pretty interesting point.

That's basically correct. To give a little more context for why we don't really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time. It was very common for baseliners to realize that finishing the task would take too long, give up, and try to collect speed bonuses on other tasks. This is somewhat concerning for biasing the human time-to-complete estimates, but much more concerning for this human time horizon measurement. So we don't claim the human time horizon as a result.

[-]Jonas Hallgren2d82

Looking at the METR paper's analysis, there might be an important consideration about how they're extrapolating capabilities to longer time horizons. The data shows a steep exponential decay in model success rates as task duration increases. I might be wrong here but it seems weird to be taking an arbitrary cutoff of 50% and doing a linear extrapolation from that?

The logistic curves used to estimate time horizons assume a consistent relationship between task duration and difficulty across all time scales. However, it's plausible that tasks requiring hours or days involve fundamentally different cognitive processes than shorter tasks. From both probabilistic machine learning and neuroscience perspectives, there's reason to expect that autoregressive models (like current LLMs) would struggle disproportionately with extended time horizons compared to systems with more robust memory and online learning capabilities. This is similar to the bear case from Thane Ruthenis and I still feel this isn't addressed?

More speculative:

The model is in short that: humans are iterative learners, and being that helps them form self-other boundaries, this allows them to plan with themselves in mind because they know what the consistent parts of the world is and can thus account for them in the future. For long term planning, this drastically reduces the computational costs of knowing what to do, autoregression doesn't do this directly but rather indirectly. Without heuristic learning in your world model, computational costs goes up by quite a lot. If you're not trained on heuristic learning, I don't see how it will naturally arise in the deeper parts of the models. Cognition development is stochastic.

I think this is an algorithmic speedbump that will take 3-4 years extra to go around, especially since people are still bullish on the LLM scaling approach. I don't know what weird stuff will arise when people start figuring out online learning with RL but that's another question.

[-]Thomas Kwa1d102

All models since at least GPT-3 have had this steep exponential decay [1], and the whole logistic curve has kept shifting to the right. The 80% success rate horizon has basically the same 7-month doubling time as the 50% horizon so it's not just an artifact of picking 50% as a threshold.

Claude 3.7 isn't doing better on >2 hour tasks than o1, so it might be that the curve is compressing, but this might also just be noise or imperfect elicitation.

Regarding the idea that autoregressive models would plateau at hours or days, it's plausible, and one point of evidence is that models are not really coherent over hundreds of steps (generations + uses of the Python tool) yet-- they do 1-2 hour tasks with ~10 actions, see section 5 of HCAST paper. On the other hand, current LLMs can learn a lot in-context and it's not clear there are limits to this. In our qualitative analysis we found evidence of increasing coherence, where o1 fails tasks due to repeating failed actions 6x less than GPT-4 1106.

Maybe this could be tested by extracting ~1 hour tasks out of the hours to days long projects that we think are heavy in self-modeling, like planning. But we will see whether there's a plateau at the hours range in the next year or two anyway.

[1] we don't have easy enough tasks that GPT-2 can do them with >50% success, so can't check the shape

[-]gwern11h168

One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing - just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex, which makes sense if we assume that these LLMs are doing a lot of retries and backtracking, which constitute a 'search' process as a whole, even if they never explicitly represent or model a decision/game tree, and have error rates stemming from their blindspots and biases. And you can tell a similar story there about error rates and exponentials: all the critical steps have to be right (omitting ones which don't do anything, ones which get undone or reset, etc), and the final result is either right or wrong as you do the task or not.

(And on a more detailed mechanistic level, you can tell a story where NNs learn 'atoms' of skills over scaling, power-law distributed in random naturalistic data, which are recombined to solve each 'new' inner-monologue problem, and if you have 'memorized' enough atoms, you can solve every task which is just a reconfiguration of known atoms, and that is just what 'learning' and 'generalization' are.)

But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It's not the case that if a human can't do a task in 4 hours there's basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it. Some people will go so far as to anthropomorphize human agents and call this 'learning', and so I will refer to these temporal correlations as learning too.

Why the difference between machine and human learning? Well, you might ask, given this sigmoid wall, how did we get so much higher performance from GPT-2 to Claude-3.7? How did o1-style models go from flailing about to far higher performance on coding/reasoning tasks even at the same size model? And how did we go from below amateur Go AI (AlphaZero at the start of training) to strongly superhuman Go AI (AlphaZero at the end of training), with the same size model? The shocking but true answer is... we trained better neural networks. (And larger too, of course, but that was not strictly necessary.) We didn't prompt them or do brute-force best-of-n samples search or even MCTS search a (randomly initialized) model or use a really really large context window on GPT-2. But we trained them, so they could learn new and better stuff. (Another way one could make the point: if self-attention really is a perfect substitute for gradient descent on the weights, and there is no crossover point, why do we not just 'train' models using purely linear self-attention on trillions of tokens, and use that instead? Why does anyone still bother with, say, finetuning instead of putting that dataset into the context and caching it?)

Incidentally, what do GPT-2, GPT-4, and Claude-3.7 all share in common, that is not just untrue, but nearly impossible for a human doing a task? They have frozen weights which do no learning at runtime.

So I would suggest that the sigmoid we see here is mostly what we would expect from using a frozen non-learning model to do search over a difficult game/task, and that if the LLMs were able to properly learn using finetuning (or an online equivalent like dynamic evaluation), you would see different and more human-like temporal scaling: where the success rate declines more gradually and plateaus at a higher asymptote, as within-episode, it observes poorly-modeled environment dynamics and improves its predictions of those, observes its errors and avoids repeating them in favor of new things, knows what it has and hasn't done without having to reason over the entire history (filled with false starts and errors), and can explicitly reason about things and incorporate the results of the reasoning directly into the weights computing everything else.