Noticing progress in long reasoning models like o3 creates a different blind spot compared to popular reporting on how scaling of pretraining is stalling out. It can appear that long reasoning models reconcile the central point of pretraining stalling out with AI progress moving fast. But plausible success of reasoning models instead suggests that pretraining will continue scaling even more[1] than could be expected before.
Training systems were already on track to go from 50 MW, training current models for up to 1e26 FLOPs, to 150 MW in late 2024, and then 1 GW by end on 2025, training models for up to 5e27 FLOPs in 2026, 250x compute of original GPT-4. But with o3, it now seems more plausible that $150bn training systems will be built in 2026-2027, training models for up to 5e28 FLOPs in 2027-2028, which is 500x compute of the currently deployed 1e26 FLOPs models or 2500x compute of original GPT-4.
Scaling of pretraining is not stalling out, even without the new long reasoning paradigm. It might begin stalling out in 2026 at the earliest, but now more likely only in 2028. The issue is that the scale of training systems is not directly visible, there is a 1-2 year lag between decisions to build them and the observed resulting AI progress.
Reporting on how scaling is stalling out might have a point in returns on scale getting worse than expected. But if scale still keeps increasing despite that, there will be capabilities resulting from additional scale. Scaling by 10x in compute might do very little, and this is compatible with scaling by 500x in compute bringing a qualitative change. ↩︎
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to build Machine Superintelligence. Consider subscribing to stay up to date with my work.
Depending on who you follow, you might think AI is hitting a wall or that it's moving faster than ever.
I was open to the former and even predicted back in June that the jump to the next generation of language models, like GPT-5, would disappoint. But I now think the evidence points toward progress continuing and maybe even accelerating.
This is primarily thanks to the advent of new "reasoning" models like OpenAI's o-series and DeepSeek, a Chinese open-weight model that is nipping at the heels of the American frontier. In essence, these models spend more time and compute on inference, "thinking" about harder prompts, instead of just spitting out an answer.
In my June prediction, I wrote that "We haven't seen anything more than marginal improvements in the year+ since GPT-4." But I now think I was wrong.
Instead, there's a widening gap between AI's public face and its true capabilities. I wrote about this in a TIME Ideas essay that published yesterday. I hit on many of its key points here, but there's a lot more in the full piece, and I encourage you to check it out!
AI progress isn't stalling — it's becoming increasingly illegible
I argued that while everyday users still encounter chatbots that can't count the "Rs" in "strawberry" and the media declares an AI slowdown, behind the scenes, AI is rapidly advancing in technical domains that may end up turbo-charging everything else.
For example, in ~1 year, AI went from barely beating random chance to surpassing experts on PhD-level science questions. OpenAI says that its latest model, o3, now beats human experts in their own field by nearly 20%.
However, as I wrote in TIME:
Source: Epoch AI
In just months, models went from 2% to 25% on possibly the hardest AI math benchmark in existence.
And perhaps most significantly, AI systems are getting way better at programming. From the TIME essay:
Sources: SWE-Bench, OpenAI, chart by Garrison Lovely
There have been similarly dramatic improvements on other benchmarks for programming, math, and machine learning research. But unless you follow the industry closely, it's very hard to figure out what this actually means.
Here's an attempt to spell that out from the TIME essay:
(I'd expect OpenAI's latest model, o3, to do significantly better on the METR evaluation based on its other scores.)
Source: METR
But I can't put the significance of this research better than METR researcher Chris Painter did. He asks us to imagine a dashboard monitoring various AI risks — from bioweapons to political manipulation. If any one capability starts rising, we can work to mitigate it. But what happens when the dashboard shows progress in AI's ability to improve itself? Chris writes:
For more on how this kind of recursive self-improvement could occur and what might happen next, check out this section of my Jacobin cover story.
Smarter models can scheme better
Meanwhile, there's a bunch of new research finding that smarter models are more capable of scheming, deception, sabotage, etc.
In TIME, I spelled out my fear that this mostly invisible progress will leave us dangerously unprepared for what's coming. I worried that politicians and the public will ignore this AI progress, because they can't see the improvements first-hand. All the while, AI companies will continue to advance toward their goal of automating AI research, bootstrapping the automation of everything else.
Right now, the industry is mostly self-regulating, and, at least in the US, that looks unlikely to change anytime soon — unless there's some kind of "warning shot" that motivates action.
Of course, there may be no warning shots, or we may ignore them. Given that many of the leading figures in the field say "no one currently knows how to reliably align AI behavior with complex values," this is cause for serious concern. And the stakes are high, as I wrote in TIME:
o3 and the media
I had already written the first draft of the TIME essay when o3 was announced by OpenAI on December 20th. My timeline was freaking out about the shocking gains made on many of these same extremely tough technical benchmarks.
I thought, 'oh man, I need to rework this essay because of how much o3 undermines the thesis that AI progress is stalling out!' But the mainstream media dramatically under-covered the announcement, with most big news sites making no mention of it at all.
In fact, the day after the o3 announcement you could find headlines in the NYT, WIRED, WSJ, and Bloomberg suggesting AI progress was slowing down!
This is not a knock on these individual stories, which contain important reporting, analysis, causes for skepticism, etc. But collectively, the mainstream media is painting a misleading picture of the state of AI that makes it more likely we'll be unprepared for what's coming. (Shakeel Hashim of Transformer had a great, relevant piece on journalism and AGI for Nieman Lab in December.)
Just as one deep learning paradigm might be stalling out, a new one is emerging and iterating faster than ever. There were almost 3 years between GPT-3 and 4, but o3 was announced just ~3.5 months after its predecessor, with huge benchmark gains. There are many reasons this pace might not continue, but to say AI is slowing down seems premature at best.
The gap between AI's public face and its true capabilities is widening by the month. The real question isn't whether AI is hitting a wall — it's whether we'll see what's coming before it's too late.
If you enjoyed this post, please subscribe to The Obsolete Newsletter.