snewman

Software engineer and repeat startup founder; best known for Writely (aka Google Docs). Now starting https://www.aisoup.org to foster constructive expert conversations about open questions in AI and AI policy, and posting at https://amistrongeryet.substack.com and https://x.com/snewmanpv.

Posts

Sorted by New

42Updates from Comments on "AI 2027 is a Bet Against Amdahl's Law"

2mo

66Interpreting the METR Time Horizons Post

2mo

126AI 2027 is a Bet Against Amdahl's Law

3mo

140What Indicators Should We Watch to Disambiguate AGI Timelines?

6mo

11Towards Better Milestones for Monitoring AI Capabilities

22The AI Explosion Might Never Happen

Wikitag Contributions

Comments

Sorted by

Newest

Updates from Comments on "AI 2027 is a Bet Against Amdahl's Law"

snewman2mo30

Thanks. This is helpful, but my intuition is substantially coming from the idea that there might be other factors involved (activities / processes involved in improving models that aren't "thinking about algorithms", "writing code", or "analyzing data"). In other words, I have a fair amount of model uncertainty, especially when thinking about very large speedups.

What's going on with AI progress and trends? (As of 5/2025)

snewman2mo10

quantity of useful environments that AI companies have

Meaning, the number of distinct types of environments they've built (e.g. one to train on coding tasks, one on math tasks, etc.)? Or the number of instances of those environments they can run (e.g. how much coding data they can generate)?

What's going on with AI progress and trends? (As of 5/2025)

snewman2moΩ9130

GPT-4.5 is going to be quickly deprecated

It's still a data point saying that OpenAI chose to do a large training run, though, right? Even if they're currently not planning to make sustained use of the resulting model in deployment. (Also, my shaky understanding is that expectations are for a GPT-5 to be released in the coming months and that it may be a distilled + post-trained derivative of GPT-4.5, meaning GPT-5 would be downstream of a large-compute-budget training process?)

Interpreting the METR Time Horizons Post

snewman2mo30

Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).

Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.

Interpreting the METR Time Horizons Post

snewman2mo10

Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren't yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they're being compared against aren't yet benefiting from context-learning effects).

One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...

Interpreting the METR Time Horizons Post

snewman2mo30

Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!

Agreed. But that means time from today to AGI is the sum of:

Time for task horizons to increase from 1.5 hours (the preliminary o3 result) to 1 month
Plausibly "a few years" to progress from 1-month-coder to AGI.

If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few years" to be 2 years, we're in 2029, which is farther out than "the most aggressive forecasts" (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).

And given the starting assumptions, those are fairly aggressive numbers. Thomas' guess that "capability on more realistic tasks will follow the long-term 7-month doubling time" would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.

Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from "the most aggressive forecasts"?

(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)

Interpreting the METR Time Horizons Post

snewman2mo10

To be clear, I agree that the bad interpretations were not coming from METR.

AI 2027 is a Bet Against Amdahl's Law

snewman3mo10

Thanks, I've edited the post to note this.

AI 2027 is a Bet Against Amdahl's Law

snewman3mo10

Sure – I was presenting these as "human-only, software-only" estimates:

Here are the median estimates of the "human-only, software-only" time needed to reach each milestone:
Saturating RE-Bench → Superhuman coder: three sets of estimates are presented, with medians summing to between 30 and 75 months^[6]. The reasoning is presented here.

So it doesn't seem like there's a problem here?

AI 2027 is a Bet Against Amdahl's Law

snewman3mo10

I added up the median "Predictions for gap size" in the "How fast can the task difficulty gaps be crossed?" table, summing each set of predictions separately ("Eli", "Nikola", "FutureSearch") to get three numbers ranging from 30-75.

Does this table cover the time between now and superhuman coder? I thought it started at RE-Bench, because:

I took all of this to be in context of the phrase, about one page back, "For each gap after RE-Bench saturation"
The earlier explanation that Method 2 is "a more complex model starting from a forecast saturation of an AI R&D benchmark (RE-Bench), and then how long it will take to go from that system to one that can handle real-world tasks at the best AGI company" [emphasis added]
The first entry in the table ("Time horizon: Achieving tasks that take humans lots of time") sounds more difficult than saturating RE-Bench.
Earlier, there's a separate discussion forecasting time to RE-bench saturation.

But sounds like I was misinterpreting?