snewman

Software engineer and repeat startup founder; best known for Writely (aka Google Docs). Now starting https://www.aisoup.org to foster constructive expert conversations about open questions in AI and AI policy, and posting at https://amistrongeryet.substack.com and https://x.com/snewmanpv.

Wikitag Contributions

Comments

Sorted by
snewman30

Thanks. This is helpful, but my intuition is substantially coming from the idea that there might be other factors involved (activities / processes involved in improving models that aren't "thinking about algorithms", "writing code", or "analyzing data"). In other words, I have a fair amount of model uncertainty, especially when thinking about very large speedups.

snewman10

quantity of useful environments that AI companies have

Meaning, the number of distinct types of environments they've built (e.g. one to train on coding tasks, one on math tasks, etc.)? Or the number of instances of those environments they can run (e.g. how much coding data they can generate)?

snewmanΩ9130

GPT-4.5 is going to be quickly deprecated

It's still a data point saying that OpenAI chose to do a large training run, though, right? Even if they're currently not planning to make sustained use of the resulting model in deployment. (Also, my shaky understanding is that expectations are for a GPT-5 to be released in the coming months and that it may be a distilled + post-trained derivative of GPT-4.5, meaning GPT-5 would be downstream of a large-compute-budget training process?)

snewman30

Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).

Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.

snewman10

Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren't yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they're being compared against aren't yet benefiting from context-learning effects).

One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...

snewman30

Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!

Agreed. But that means time from today to AGI is the sum of:

  1. Time for task horizons to increase from 1.5 hours (the preliminary o3 result) to 1 month
  2. Plausibly "a few years" to progress from 1-month-coder to AGI.

If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few years" to be 2 years, we're in 2029, which is farther out than "the most aggressive forecasts" (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).

And given the starting assumptions, those are fairly aggressive numbers. Thomas' guess that "capability on more realistic tasks will follow the long-term 7-month doubling time" would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.

Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from "the most aggressive forecasts"?

(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)

snewman10

To be clear, I agree that the bad interpretations were not coming from METR.

snewman10

Thanks, I've edited the post to note this.

snewman10

Sure – I was presenting these as "human-only, software-only" estimates:

Here are the median estimates of the "human-only, software-only" time needed to reach each milestone:

  • Saturating RE-Bench → Superhuman coder: three sets of estimates are presented, with medians summing to between 30 and 75 months[6]. The reasoning is presented here.

So it doesn't seem like there's a problem here?

snewman10

I added up the median "Predictions for gap size" in the "How fast can the task difficulty gaps be crossed?" table, summing each set of predictions separately ("Eli", "Nikola", "FutureSearch") to get three numbers ranging from 30-75.

Does this table cover the time between now and superhuman coder? I thought it started at RE-Bench, because:

  • I took all of this to be in context of the phrase, about one page back, "For each gap after RE-Bench saturation"
  • The earlier explanation that Method 2 is "a more complex model starting from a forecast saturation of an AI R&D benchmark (RE-Bench), and then how long it will take to go from that system to one that can handle real-world tasks at the best AGI company" [emphasis added]
  • The first entry in the table ("Time horizon: Achieving tasks that take humans lots of time") sounds more difficult than saturating RE-Bench.
  • Earlier, there's a separate discussion forecasting time to RE-bench saturation.

But sounds like I was misinterpreting?

Load More