Software engineer and repeat startup founder; best known for Writely (aka Google Docs). Now starting https://www.aisoup.org to foster constructive expert conversations about open questions in AI and AI policy, and posting at https://amistrongeryet.substack.com and https://x.com/snewmanpv.
quantity of useful environments that AI companies have
Meaning, the number of distinct types of environments they've built (e.g. one to train on coding tasks, one on math tasks, etc.)? Or the number of instances of those environments they can run (e.g. how much coding data they can generate)?
GPT-4.5 is going to be quickly deprecated
It's still a data point saying that OpenAI chose to do a large training run, though, right? Even if they're currently not planning to make sustained use of the resulting model in deployment. (Also, my shaky understanding is that expectations are for a GPT-5 to be released in the coming months and that it may be a distilled + post-trained derivative of GPT-4.5, meaning GPT-5 would be downstream of a large-compute-budget training process?)
Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I'd argue that the target success rate should be higher than 80%).
Also potential factors for "task messiness" and the 5-18x context penalty, though as you've pointed out elsewhere, the latter should arguably be discounted.
Agreed that we should expect the performance difference between high- and low-context human engineers to diminish as task sizes increase. Also agreed that the right way to account for that might be to simply discount the 5-18x multiplier when projecting forwards, but I'm not entirely sure. I did think about this before writing the post, and I kept coming back to the view that when we measure Claude 3.7 as having a 50% success rate at 50-minute tasks, or o3 at 1.5-hour tasks, we should substantially discount those timings. On reflection, I suppose the counterargument is that this makes the measured doubling times look more impressive, because (plausibly) if we look at a pair of tasks that take low-context people 10 and 20 minutes respectively, the time ratio for realistically high-context people might be more than 2x. But I could imagine this playing out in other ways as well (e.g. maybe we aren't yet looking at task sizes where people have time to absorb a significant amount of context, and so as the models climb from 1 to 4 to 16 to 64 minute tasks, the humans they're being compared against aren't yet benefiting from context-learning effects).
One always wishes for more data – in this case, more measurements of human task completion times with high and low context, on more problem types and a wider range of time horizons...
Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!
Agreed. But that means time from today to AGI is the sum of:
If we take the midpoint of Thomas Kwa's "3-4 months" guess for subsequent doubling time, we get 23.8 months for (1). If we take "a few years" to be 2 years, we're in 2029, which is farther out than "the most aggressive forecasts" (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).
And given the starting assumptions, those are fairly aggressive numbers. Thomas' guess that "capability on more realistic tasks will follow the long-term 7-month doubling time" would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.
Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from "the most aggressive forecasts"?
(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)
To be clear, I agree that the bad interpretations were not coming from METR.
Thanks, I've edited the post to note this.
Sure – I was presenting these as "human-only, software-only" estimates:
Here are the median estimates of the "human-only, software-only" time needed to reach each milestone:
So it doesn't seem like there's a problem here?
I added up the median "Predictions for gap size" in the "How fast can the task difficulty gaps be crossed?" table, summing each set of predictions separately ("Eli", "Nikola", "FutureSearch") to get three numbers ranging from 30-75.
Does this table cover the time between now and superhuman coder? I thought it started at RE-Bench, because:
But sounds like I was misinterpreting?
Thanks. This is helpful, but my intuition is substantially coming from the idea that there might be other factors involved (activities / processes involved in improving models that aren't "thinking about algorithms", "writing code", or "analyzing data"). In other words, I have a fair amount of model uncertainty, especially when thinking about very large speedups.