Comment Permalink

Ben Pace9d52

I'm never sure if it makes sense to add that clause every time I talk about the future.

See in context

323 A Bear Case: My Predictions Regarding AI Progress

by Thane Ruthenis

5th Mar 2025

11 min read

144

323

This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.

I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations^[1], and this is my current baseline prediction.

The Current Paradigm: I'm Tucking In to Sleep

I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI^[2].

I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.
- Grok 3 and GPT-4.5 seem to confirm this.
  - Grok 3's main claim to fame was "pretty good: it managed to dethrone Claude Sonnet 3.5.1 for some people!". That was damning with faint praise.
  - GPT-4.5 is subtly better than GPT-4, particularly at writing/EQ. That's likewise a faint-praise damnation: it's not much better. Indeed, it reportedly came out below expectations for OpenAI as well, and they certainly weren't in a rush to release it. (It was intended as a new flashy frontier model, not the delayed, half-embarrassed "here it is I guess, hope you'll find something you like here".)
- GPT-5 will be even less of an improvement on GPT-4.5 than GPT-4.5 was on GPT-4. The pattern will continue for GPT-5.5 and GPT-6, the ~1000x and 10000x models they may train by 2029 (if they still have the money by then). Subtle quality-of-life improvements and meaningless benchmark jumps, but nothing paradigm-shifting.
  - (Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.)
- OpenAI seem to expect this, what with them apparently planning to slap the "GPT-5" label on the Frankenstein's monster made out of their current offerings instead of on, well, 100x'd GPT-4. They know they can't cause another hype moment without this kind of trickery.
Test-time compute/RL on LLMs:
- It will not meaningfully generalize beyond domains with easy verification. Some trickery like RLAIF and longer CoTs might provide some benefits, but they would be a fixed-size improvement. It will not cause a hard-takeoff self-improvement loop in "soft" domains.
- RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.
- Scaling CoTs to e. g. millions of tokens or effective-indefinite-size context windows (if that even works) may or may not lead to math being solved. I expect it won't.
  - It may not work at all: the real-world returns on investment may end up linear while the costs of pretraining grow exponentially. I mostly expect FrontierMath to be beaten by EOY 2025 (it's not that difficult), but maybe it won't be beaten for years.^[3]
  - Even if it "technically" works to speed up conjecture verification, I'm skeptical on this producing paradigm shifts even in "hard" domains. That task is not actually an easily verifiable one.
- (If math is solved, though, I don't know how to estimate the consequences, and it might invalidate the rest of my predictions.)
"But the models feel increasingly smarter!":
- It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
- My guess is that it's most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more appealing, compared to e. g. OpenAI's corporate drones.
- The recent upgrade to GPT-4o seems to confirm this. They seem to have merely given it a better personality, and people were reporting that it "feels much smarter".
- Deep Research was this for me, at first. Some of its summaries were just pleasant to read, they felt so information-dense and intelligent! Not like typical AI slop at all! But then it turned out most of it was just AI slop underneath anyway, and now my slop-recognition function has adjusted and the effect is gone.
What LLMs are good at: eisegesis-friendly problems and in-distribution problems.
- Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
  - This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
    - See R1's CoTs, where it often does... that.
  - This also covers stuff like Deep Research's outputs. They're great specifically as high-level overviews of a field, when you're not relying on them to be comprehensive or precisely on-target or for any given detail to be correct.
  - It feels like this issue is easy to fix. LLMs already have ~all of the needed pieces, they just need to learn to recognize good ideas! Very few steering-bits to inject!
  - This issue felt easy to fix since GPT-3.5, or perhaps GPT-2.
  - This issue is not easy to fix.
- In-distribution problems:
  - One of the core features of the current AIs is the "jagged frontier" of capabilities.
  - This jaggedness is often defended by "ha, as if humans don't have domains in which they're laughably bad/as if humans don't have consistent cognitive errors!". I believe that counterargument is invalid.
  - LLMs are not good in some domains and bad in others. Rather, they are incredibly good at some specific tasks and bad at other tasks. Even if both tasks are in the same domain, even if tasks A and B are very similar, even if any human that can do A will be able to do B.
  - This is consistent with the constant complaints about LLMs and LLM-based agents being unreliable and their competencies being impossible to predict (example).
  - That is: It seems the space of LLM competence shouldn't be thought of as some short-description-length connected manifold or slice through the space of problems, whose shape we're simply too ignorant to understand yet. (In which case "LLMs are genuinely intelligent in a way orthogonal to how humans are genuinely intelligent" is valid.)
  - Rather, it seems to be a set of individual points in the problem-space, plus these points' immediate neighbourhoods... Which is to say, the set of problems the solutions to which are present in their training data.^[4]
  - The impression that they generalize outside it is based on us having a very poor grasp regarding the solutions to what problems are present in their training data.
  - And yes, there's some generalization. But it's dramatically less than the impressions people have of it.
Agency:
- Genuine agency, by contrast, requires remaining on-target across long inferential distances: even after your task's representation becomes very complex in terms of the templates which you had memorized at the start.
- LLMs still seem as terrible at this as they'd been in the GPT-3.5 age. Software agents break down once the codebase becomes complex enough, game-playing agents get stuck in loops out of which they break out only by accident, etc.
- They just have bigger sets of templates now, which lets them fool people for longer and makes them useful for marginally more tasks. But the scaling on that seems pretty bad, and this certainly won't suffice for autonomously crossing the astronomical inferential distances required to usher in the Singularity.
"But the benchmarks!"
- I dunno, I think they're just not measuring what people think they're measuring. See the point about in-distribution problems above, plus the possibility of undetected performance-gaming, plus some subtly but crucially unintentionally-misleading reporting.
- Case study: Prior to looking at METR's benchmark, I'd expected that it's also (unintentionally!) doing some shenanigans that mean it's not actually measuring LLMs' real-world problem-solving skills. Maybe the problems were secretly in the training data, or there was a selection effect towards simplicity, or the prompts strongly hinted at what the models are supposed to do, or the environment was set up in an unrealistically "clean" way that minimizes room for error and makes solving the problem correctly the path of least resistance (in contrast to messy real-world realities), et cetera.
  As it turned out, yes, it's that last one: see the "systematic differences from the real world" here. Consider what this means in the light of the previous discussion about inferential distances/complexity-from-messiness.

As I'd said, I'm not 100% sure of that model. Further advancements might surprise me, there's an explicit carve-out for ??? consequences if math is solved, etc.

But the above is my baseline prediction, at this point, and I expect the probability mass on other models to evaporate by this year's end.

Real-World Predictions

I dare not make the prediction that the LLM bubble will burst in 2025, or 2026, or in any given year in the near future. The AGI labs have a lot of money nowadays, they're managed by smart people, they have some real products, they're willing to produce propaganda, and they're buying their own propaganda (therefore it will appear authentic). They can keep the hype up for a very long time, if they want.
- And they do want to. They need it, so as to keep the investments going. Oceans of compute is the only way to collect on the LLM bet they've made, in the worlds where that bet can pay off, so they will keep maximizing for investment no matter how dubious the bet's odds start looking.
- Because what else are they to do? If they admit to themselves they're not closing their fingers around godhood after all, what will they have left?
There will be news of various important-looking breakthroughs and advancements, at a glance looking very solid even to us/experts. Digging deeper, or waiting until the practical consequences of these breakthroughs materialize, will reveal that they're 80% hot air/hype-generation.^[5]
At some point there might be massive layoffs due to ostensibly competent AI labor coming onto the scene, perhaps because OpenAI will start heavily propagandizing that these mass layoffs must happen. It will be an overreaction/mistake. The companies that act on that will crash and burn, and will be outcompeted by companies that didn't do the stupid.
Inasmuch as LLMs boost productivity, it will mostly be as tools. There's a subtle but crucial difference between "junior dev = an AI model" and "senior dev + AI models = senior dev + team of junior devs". Both decrease the demand for junior devs (as they exist today, before they re-specialize into LLM whisperers or whatever). But the latter doesn't really require LLMs to be capable of end-to-end autonomous task execution, which is the property required for actual transformative consequences.
- (And even then, all the rumors about LLMs 10x'ing programmer productivity seem greatly overstated.)
Inasmuch as human-worker replacements will come, they will be surprisingly limited in scope. I dare not make a prediction regarding the exact scope and nature, only regarding the directionality compared to current expectations.
There will be a ton of innovative applications of Deep Learning, perhaps chiefly in the field of biotech, see GPT-4b and Evo 2. Those are, I must stress, human-made innovative applications of the paradigm of automated continuous program search. Not AI models autonomously producing innovations.
There will be various disparate reports about AI models autonomously producing innovations, in the vein of this or that or that. They will turn out to be misleading or cherry-picked. E. g., examining those examples:
- In the first case, most of the improvements turned out to be reward-hacking (and not even intentional on the models' part).
- In the second case, the scientists have pre-selected the problem on which the LLM is supposed to produce the innovation on the basis of already knowing that there's a low-hanging fruit to be picked there. That's like 90% of the work. And then they further picked the correct hypothesis from the set it generated, i. e., did eisegesis. And also there might be any amount of data contamination from these scientists or different groups speaking about their research in public, in the years they spent working on it.
- In the third case, the AI produces useless slop with steps like "..., Step N: invent the Theory of Everything (left as an exercise for the reader), ...", lacking the recognition function for promising research. GPT-3-level stuff. (The whole setup can also likely be out-performed by taking the adjacency matrix of Wikipedia pages and randomly sampling paths from the corresponding graph, or something like this.)
I expect that by 2030s, LLMs will be heavily integrated into the economy and software, and will serve as very useful tools that found their niches. But just that: tools. Perhaps some narrow jobs will be greatly transformed or annihilated (by being folded into the job of an LLM nanny). But there will not be AGI or broad-scope agents arising from the current paradigm, nor autonomous 10x engineers.
At some unknown point – probably in 2030s, possibly tomorrow (but likely not tomorrow) – someone will figure out a different approach to AI. Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach. Maybe it will happen in a major AGI lab, maybe in some new startup. By default, everyone will die in <1 year after that.

Closing Thoughts

This might seem like a ton of annoying nitpicking. Here's a simple generator of all of the above observations: some people desperately, desperately want LLMs to be a bigger deal than what they are.

They are not evaluating the empirical evidence in front of their eyes with proper precision.^[6] Instead, they're vibing, and spending 24/7 inventing contrived ways to fool themselves and/or others.

They often succeed. They will continue doing this for a long time to come.

We, on the other hand, desperately not want LLMs to be AGI-complete. Since we try to avoid motivated thinking, to avoid deluding ourselves into believing into happier realities, we err on the side of pessimistic interpretations. In this hostile epistemic environment, that effectively leads to us being overly gullible and prone to buying into hype.

Indeed, this environment is essentially optimized for exploiting the virtue of lightness. LLMs are masters at creating the vibe of being generally intelligent. Tons of people are cooperating, playing this vibe up, making tons of subtly-yet-crucially flawed demonstrations. Trying to see through this immense storm of bullshit very much feels like "fighting a rearguard retreat against the evidence".^[7]

But this isn't what's happening, in my opinion. On the contrary: it's the LLM believers who are sailing against the winds of evidence.

If LLMs were actually as powerful as they're hyped up to be, there wouldn't be the need for all of these attempts at handholding.

Ever more contrived agency scaffolds that yield ~no improvement. Increasingly more costly RL training procedures that fail to generalize. Hail-mary ideas regarding how to fix that generalization issue. Galaxy-brained ways to elicit knowledge out of LLMs that produce nothing of value. The need for all of this is strong evidence that there's no seed of true autonomy/agency/generality within LLMs. If there were, the most naïve AutoGPT setup circa early 2023 would've elicited it.

People are extending LLMs a hand, hoping to pull them up to our level. But there's nothing reaching back.

And none of the current incremental-scaling approaches will fix the issue. They will increasingly mask it, and some of this masking may be powerful enough to have real-world consequences. But any attempts at the Singularity based on LLMs will stumble well before takeoff.

Thus, I expect AGI Labs' AGI timelines have ~nothing to do with what will actually happen. On average, we likely have more time than the AGI labs say. Pretty likely that we have until 2030, maybe well into 2030s.

By default, we likely don't have much longer than that. Incremental scaling of known LLM-based stuff won't get us there, but I don't think the remaining qualitative insights are many. 5-15 years, at a rough guess.

^{^}
For prudency's sake: GPT-4.5 has slightly overshot these expectations.
^{^}
If you are really insistent on calling the current crop of SOTA models "AGI", replace this with "autonomous AI" or "transformative AI" or "innovative AI" or "the transcendental trajectory" or something.
^{^}
Will o4 really come out on schedule in ~2 weeks, showcasing yet another dramatic jump in mathematical capabilities, just in time to rescue OpenAI from the GPT-4.5 semi-flop? I'll be waiting.
^{^}
This metaphor/toy model has been adapted from @Cole Wyeth.
^{^}
Pretty sure Deep Research could not in fact "do a single-digit percentage of all economically valuable tasks in the world", except in the caveat-laden sense where you still have a human expert double-checking and rewriting its outputs. And in my personal experience, on the topics at which I am an expert, it would be easier to write the report from scratch than to rewrite DR's output.
It's a useful way to get a high-level overview of some topics, yes. It blows Google out of the water at being Google, and then some. But I don't think it's a 1-to-1 replacement for any extant form of human labor. Rather, it's a useful zero-to-one thing.
^{^}
See all the superficially promising "AI innovators" from the previous section, which turn out to be false advertisement on a closer look. Or the whole "10x'd programmer productivity" debacle.
^{^}
Indeed, even now, having written all of this, I have nagging doubts that this might be what I'm actually doing here. I will probably keep having those doubts until this whole thing ends, one way or another. It's not pleasant.

AI RiskAI TimelinesAIWorld Modeling

Curated

323

Mentioned in

34AI #106: Not so Fast

11The Fork in the Road

A Bear Case: My Predictions Regarding AI Progress

New Comment

144 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:13 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Daniel Kokotajlo11d10881

some people desperately, desperately want LLMs to be a bigger deal than what they are.

A larger number of people, I think, desperately desperately want LLMs to be a smaller deal than what they are.

[-]ACCount10d4223

The more mainstream you go, the larger this effect gets. A lot of people seemingly want AI to be a nothingburger.

When LLMs emerged, in mainstream circles, you'd see people go "it's not important, it's not actually intelligent, you can see it make the kind of reasoning mistakes a 3 year old would".

Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"

I'd say that LessWrong is far better calibrated.

People who weren't familiar with programming or AI didn't have a grasp of how hard natural language processing or commonsense reasoning used to be for machines. Nor do they grasp the implications of scaling laws.

[-]Thane Ruthenis10d16-2

Meanwhile, on LessWrong: "holy shit, this is a big fucking deal, because it's already making the same kind of reasoning mistakes a human three year old would!"

FWIW, that was me in 2022, looking at GPT-3.5 and being unable to imagine how capabilities can progress from there that doesn't immediately hit ASI. (I don't think I ever cared about benchmarks. Brilliant humans can't necessarily ace math exams, so why would I gatekeep the AGI term behind that?)

Now it's two-and-a-half years later and I no longer see it. As far as I'm concerned, this paradigm harnessed most of its general-reasoning potential at 3.5 and is now asymptoting out around something. I don't know what this something is, but it doesn't seem to be "AGI".

All "improvement" since then has just been window dressing; the models learning to convincingly babble about ever-more-sophisticated abstractions and solve ever-more-complicated math/coding puzzles that make their capabilities legible to ever-broader categories of people. But it's not anything GPT-3.5 wasn't already fundamentally capable of; and GPT-3.5 was not capable of taking off, and there's been no new fundamental capability advances since then.

(I remember dreading ... (read more)

5Kabir Kumar6d

What observations would change your mind?

2Thane Ruthenis6d

See here.

-1StopAI5d

Your observations are basically "At the point where LLM's are AGI. I will change my mind" If it solves pokemon one-shot, solves coding or human beings are superfluous for decision making. It's already practically AGI. These are bad examples! All you have shown me now is that you can't think of any serious intermediate steps LLM's have to go through before they reach AGI.

6Cole Wyeth2d

No, it's possible for LLMs to solve a subset of those problems without being AGI (even conceivable, as the history of AI research shows we often assume tasks are AI complete when they are not e.g. Hofstader with chess, Turing with the Turing test). I agree that the tests which are still standing are pretty close to AGI; this is not a problem with Thane's list though. He is correctly avoiding the failure mode I just pointed it out. Unfortunately, this does mean that we may not be able to predict AGI is imminent until the last moment. That is a consequence of the black-box nature of LLMs and our general confusion about intelligence.

[-]Thane Ruthenis11d209

Yup, the situation is somewhat symmetrical here; see also the discussion regarding which side is doing the sailing-against-the-winds-of-evidence.

My "tiebreaker" there is direct empirical evidence from working with LLMs, including attempts to replicate the most impressive and concerning claims about them. So far, this source of evidence has left me thoroughly underwhelmed.

8Rafael Harth9d

Can confirm that I'm one of these people (and yes, I worry a lot about this clouding my judgment).

7yo-cuddles11d

Definitely! However, there is more money and "hype" in the direction of wanting these to scale into AGI. Hype and anti-hype don't cancel each other out, if someone invests a billion dollars into LLM's, someone else can't spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention. We have Yudkowsky going on destiny, I guess?

2johnkclark10d

I agree. I think some people are whistling past the graveyard.

[-]johnswentworth11d4519

Noting for the sake of later evaluation: this rough picture matches my current median expectations. Not very high confidence; I'd give it roughly 60%.

[-]Cole Wyeth11d14-1

I give it ~70%, except caveats:

"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."

It won't be neurosymbolic.

Also I don't see where the 2030 number is coming from. At this point my uncertainty is almost in the exponent again. Seems like decades is plausible (maybe <50% though).

It's not clear that only one breakthrough is necessary.

[-]Vladimir_Nesov11d321

Without an intelligence explosion, it's around 2030 that scaling through increasing funding runs out of steam and slows down to the speed of chip improvement. This slowdown happens around the same time (maybe 2028-2034) even with a lot more commercial success (if that success precedes the slowdown), because scaling faster takes exponentially more money. So there's more probability density of transformative advances before ~2030 than after, to the extent that scaling contributes to this probability.

That's my reason to see 2030 as a meaningful threshold, Thane Ruthenis might be pointing to it for different reasons. It seems like it should certainly be salient for AGI companies, so a long timelines argument might want to address their narrative up to 2030 as a distinct case.

9Auspicious7d

I also found that take very unusual, especially when combined with this: The last sentence seems extremely overconfident, especially combined with the otherwise bearish conclusions in this post. I'm surprised no one else has mentioned it.

8Cole Wyeth7d

Yeah, I agree - overall I agree pretty closely with Thane about LLMs but his final conclusions don't seem to follow from the model presented here.

[-]Thane Ruthenis11d112

I'm at ~80%, for comparison.

7Kaj_Sotala10d

I think I'm also around 60-70% for the rough overall picture in the OP being correct.

7p.b.11d

Same here.

[-]Stephen McAleese11d*3419

"Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach."

I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It's hard to point to a specific period in history where humans achieved general intelligence.

Currently LLMs are undoubtedly capable at many tasks (e.g. coding, general knowledge) and much more capable than their predecessors. But it's hard to point at any particular algorithmic improvement or model and say that it was key to the success of modern LLMs.

So I think it's possible that we'll see more gradual progress and tweaks on LLMs that lead towards increasingly capable models and eventually yield AGI. Eventually you could call this progress a new architecture even though all the progress is gradual.

[-]Thane Ruthenis11d13-2

I don't think that's how it works. Local change accumulating into qualitative improvements over time is a property of continuous(-ish) search processes, such as the gradient descent and, indeed, evolution.

Human technological progress is instead a discrete-search process. We didn't invent the airplane by incrementally iterating on carriages; we didn't invent the nuclear bomb by tinkering with TNT.

The core difference between discrete and continuous search is that... for continuous search, there must be some sort of "general-purpose substrate" such that (1) any given object in the search-space can be defined as some parametrization of this substrate, and (2) this substrate then allows a way to plot a continuous path between any two objects such that all intermediate objects are also useful. For example:

For evolution, it's the genome: you could move from any organism to any other organism by doing incremental DNA adjustments, and the in-between organisms must be competitive.
For ML, it's the model's parameters: for any two programs that can be implemented on a given architecture, you can plot the path from one of them to the other, and this path is followed if it's the path of gradually

... (read more)

[-]Thomas Kwa10d*3512

A continuous manifold of possible technologies is not required for continuous progress. All that is needed is for there to be many possible sources of improvements that can accumulate, and for these improvements to be small once low-hanging fruit is exhausted.

Case in point: the nanogpt speedrun, where the training time of a small LLM was reduced by 15x using 21 distinct innovations which touched basically every part of the model, including the optimizer, embeddings, attention, other architectural details, quantization, hyperparameters, code optimizations, and Pytorch version.

Most technologies are like this, and frontier AI has even more sources of improvement than the nanogpt speedrun because you can also change the training data and hardware. It's not impossible that there's a moment in AI like the invention of lasers or the telegraph, but this doesn't happen with most technologies, and the fact that we have scaling laws somewhat points towards continuity even as other things like small differences being amplified in downstream metrics point to discontinuity. Also see my comment here on a similar topic.

If you think generalization is limited in the current regime, try to create AGI... (read more)

6johnswentworth10d

I think you should address Thane's concrete example: That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.

[-]Erik Jenner9d3018

There were no continuous language model scaling laws before the transformer architecture

https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It's possible that transformer scaling laws are much better, I haven't checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.

I also agree with Thomas Kwa's sibling comment that transformers weren't a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I'd disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn't atomic, it consists of multiple ideas—replacing RNNs/LSTMs with ... (read more)

[-]Thomas Kwa10d218

Though the fully connected -> transformers wasn't infinite small steps, it definitely wasn't a single step. We had to invent various sub-innovations like skip connections separately, progressing from RNNs to LSTM to GPT/BERT style transformers to today's transformer++. The most you could claim is a single step is LSTM -> transformer.

Also if you graph perplexity over time, there's basically no discontinuity from introducing transformers, just a possible change in slope that might be an artifact of switching from the purple to green measurement method. The story looks more like transformers being more able to utilize the exponentially increasing amounts of compute that people started using just before its introduction, which caused people to invest more in compute and other improvements over the next 8 years.

We could get another single big architectural innovation that gives better returns to more compute, but I'd give a 50-50 chance that it would be only a slope change, not a discontinuity. Even conditional on discontinuity it might be pretty small. Personally my timelines are also short enough that there is limited time for this to happen before we get AGI.

7Thane Ruthenis10d

This argument still seems to postdict that cars were invented by tinkering with carriages and horse-breeding, spacecraft was invented by tinkering with planes, refrigerators were invented by tinkering with cold cellars, et cetera. If you take the snapshot of the best technology that does X at some time T, and trace its lineage, sure, you'll often see the procession of iterative improvements on some concepts and techniques. But that line won't necessarily pass through the best-at-X technologies at times from 0 to T - 1. The best personal transportation method were horses, then cars. Cars were invented by iterating on preceding technologies and putting them together; but horses weren't involved. Similar for the best technology at lifting a human being into the sky, the best technology for keeping food cold, etc. I expect that's the default way significant technological advances happen. They don't come from tinkering with the current-best-at-X tech. They come from putting together a bunch of insights from different or non-mainstream tech trees, and leveraging them for X in a novel way. And this is what I expect for AGI. It won't come from tinkering with LLMs, it'll come from a continuous-in-retrospect, surprising-in-advance contribution from some currently-disfavored line(s) of research. (Edit: I think what I would retract, though, is the point about there not being a continuous manifold of possible technological artefacts. I think something like "the space of ideas the human mind is capable of conceiving" is essentially it.)

[-]Thomas Kwa10d101

I think we have two separate claims here:

Do technologies that have lots of resources put into their development generally improve discontinuously or by huge slope changes?
Do technologies often get displaced by technologies with a different lineage?

I agree with your position on (2) here. But it seems like the claim in the post that sometime in the 2030s someone will make a single important architectural innovation that leads to takeover within a year mostly depends on (1), as it would require progress within that year to be comparable to all the progress from now until that year. Also you said the architectural innovation might be a slight tweak to the LLM architecture, which would mean it shares the same lineage.

The history of machine learning seems pretty continuous wrt advance prediction. In the Epoch graph, the line fit on loss of the best LSTM up to 2016 sees a slope change of less than 2x, whereas a hypothetical innovation that causes takeover within a year with not much progress in the intervening 8 years would be ~8x. So it seems more likely to me (conditional on 2033 timelines and a big innovation) that we get some architectural innovation which has a moderately different l... (read more)

4Thane Ruthenis9d

Indeed, and I'm glad we've converged on (2). But... ... On second thoughts, how did we get there? The initial disagreement was how plausible it was for incremental changes to the LLM architecture to transform it into a qualitatively different type of architecture. It's not about continuity-in-performance, it's about continuity-in-design-space. Whether finding an AGI-complete architecture would lead to a discontinuous advancement in capabilities, to FOOM/RSI/sharp left turn, is a completely different topic from how smoothly we should expect AI architectures' designs to change. And on that topic, (a) I'm not very interested in reference-class comparisons as opposed to direct gears-level modeling of this specific problem, (b) this is a bottomless rabbit hole/long-standing disagreement which I'm not interested in going into at this time. That's an interesting general pattern, if it checks out. Any guesses why that might be the case? My instinctive guess is the new-paradigm approaches tend to start out promising-in-theory, but initially very bad, people then tinker with prototypes, and the technology becomes commercially viable the moment it's at least marginally better than the previous-paradigm SOTA. Which is why there's an apparent performance-continuity despite a lineage/paradigm-discontinuity.

6p.b.10d

Because these benchmarks are all in the LLM paradigm: Single input, single output from a single distribution. Or they are multi-step problems on rails. Easy verification makes for benchmarks that can quickly be cracked by LLMs. Hard verification makes for benchmarks that aren't used. One could let models play new board/computer games against average humans: Video/image input, action output. One could let models offer and complete tasks autonomously on freelancer platforms. One could enrol models in remote universities and see whether they autonomously reach graduation. It's not difficult to come up with hard benchmarks for current models (these are not close to AGI complete). I think people don't do this because they know that current models would be hopeless at benchmarks that actually aim for their shortcomings (agency, knowledge integration + integration of sensory information, continuous learning, reliability, ...)

7Thomas Kwa10d

Agree, this is one big limitation of the paper I'm working on at METR. The first two ideas you listed are things I would very much like to measure, and the third something I would like to measure but is much harder than any current benchmark given that university takes humans years rather than hours. If we measure it right, we could tell whether generalization is steadily improving or plateauing.

9dysangel9d

>There's no equivalent in technology. There isn't some "general-purpose technological substrate" such that you can start with any technological artefact, slightly perturb it, iterate, and continuously reach any other technological artefact. Discontinuous/discrete changes are needed. It sounds like you're almost exactly describing neural nets and backpropagation. A general purpose substrate that you slightly perturb to continuously and gradually move towards the desired output. I believe that as we have better ideas for self play, focusing on quality of thought processes over general knowledge, that we'll see some impressive results. I think we're already seeing signs of this in the increasing quality of smaller models.

7Gunnar_Zarncke11d

Evolution also deals with discrete units. Either the molecule replicates or it doesn't. Granted, physical evolutions is more massively parallel, but the search space is smaller in biology, but the analogy should hold as long as the search space is large enough to hide the discreteness. And if 10000s of developers try 100s of small alternatives, some few of them might hit the transformer.

-15FluidThinkers8d

2Thane Ruthenis10d

I actually looked into that recently. My initial guess was this was about "the context window" as a concept. It allows to keep vast volumes of task-relevant information around, including the outputs of the model's own past computations, without lossily compressing that information into a small representation (like with RNNs). I asked OpenAI's DR about it, and its output seems to support that guess. In retrospect, it makes sense that this would work better. If you don't know what challenges you're going to face in the future, you don't necessarily know what past information to keep around, so a fixed-size internal state was a bad idea.

1Raphael Roche3d

Exactly. Future is hard to predict and the author's strong confidence seems suspicious to me. Improvements came fast last years. 2013-2014 : word2vec and seq2seq 2017 : transformer and gpt-1 2022 : CoT prompting 2023 multimodal LLMs 2024 reasonning models. Are they linear improvements or revolutionnary breakthroughs ? Time will tell, but to me there is no sharp frontier between increment and breakthrough. It might happen that AGI results from such improvements, or not. We just don't know. But it's a fact that human general intelligence resulted from a long chain of tiny increments, and I also observe that results in ARC-AGI bench exploded with CoT/reasoning models (not just math or coding benchs). So, while 2025 could be a relative plateau, I won't be so sure that next years will also. To me a confidence far from 50% is hard to justify.

[-]Vladimir_Nesov11d200

I'm not sure raw compute (as opposed to effective compute) GPT-6 (10,000x GPT-4) by 2029 is plausible (without new commercial breakthroughs). Nvidia Rubin is 2026-2027 (models trained on it 2027-2029), so a 2029 model plausibly uses the next architecture after (though it's more likely to come out in early 2030 then, not 2029). Let's say it's 1e16 FLOP/s per chip (BF16, 4x B200) with time cost $4/hour (2x H100), that is $55bn to train for 2e29 FLOPs and 3M chips in the training system if it needs 6 months at 40% utilization (reinforcing the point that 2030 is a more plausible timing, 3M chips is a lot to manufacture). Training systems with H100s cost $50K per chip all-in to build (~BOM not TCO), so assuming it's 2x more for the after-Rubin chips the training system costs $300B to build. Also, a Blackwell chip needs 2 KW all-in (a per-chip fraction of the whole datacenter), so the after-Rubin chip might need 4 KW, and 3M chips need 12 GW.

These numbers need to match the scale of the largest AI companies. A training system ($300bn in capital, 3M of the newest chips) needs to be concentrated in the hands of a single company, probably purpose-built. And then at least $55bn of its time ne... (read more)

6Paragox11d

For funding timelines, I think the main question increasingly becomes: how much of the economical pie could be eaten by narrowly superhuman AI tooling? It doesn't take hitting an infinity/singularity/fast takeoff for plausible scenarios under this bearish reality to nevertheless squirm through the economy at Cowen-approved diffusion rates and gradually eat insane $$$ worth of value, and therefore, prop up 100b+ buildouts. OAI's latest sponsored pysop leak today seems right in line with bullet point numero uno under real world predictions, that they are going to try and push 100 billion market eaters on us whether we, ahem, high taste commentators like it or not. Perhaps I am biased by years of seeing big-numbers-detached-from-reality in FAANG, but I see the centaurized Senior SWE Thane alluded too easily eating up a 100 billion chunk[1] worldwide (at current demand, not even adjusting for the marginal cost of software -> size of software market relation!) Did anyone pay attention to the sharp RLable improvements in the O3-in-disguise Deep Research model card, vs O1? We aren't getting the singularity, yes, but scaling RL on every verifiable code PR in existence (plus 10^? of synthetic copies) seems increasingly likely to get us the junior/mid level API (I hesitate to call it agent), that will write superhuman commits for the ~90% of PRs that have well-defined and/or explicitly testable objectives. Perhaps then we will finally start seeing some of that productivity 10xing that Thane is presently and correctly skeptical off; only Senior+ need apply of course. (Side note: in the vein of documenting predictions, I currently predict that in the big tech market, at-scale Junior hiring is on its waning and perhaps penultimate cycle, with senior and especially staff compensation likewise soon skyrocketing as every ~1 mil/year USD quartet of supporting Juniors is replaced with a 300k/year Claude Pioneer subscription straight into an L6's hands.) I think the main danger

9Vladimir_Nesov11d

That's why I used the "no new commercial breakthroughs" clause, $300bn training systems by 2029 seem in principle possible both technically and financially without an intelligence explosion, just not with the capabilities legibly demonstrated so far. On the other hand, pre-training as we know it will end[1] in any case soon thereafter, because at ~current pace a 2034 training system would need to cost $15 trillion (it's unclear if manufacturing can be scaled at this pace, and also what to do with that much compute, because there isn't nearly enough text data, but maybe pre-training on all the video will be important for robotics). How far RL scales remains unclear, and even at the very first step of scaling o3 doesn't work as clear evidence because it's still unknown if it's based on GPT-4o or GPT-4.5 (it'll become clearer once there's an API price and more apples-to-apples speed measurements). ---------------------------------------- 1. This is of course a quote from Sutskever's talk. It was widely interpreted as saying it has just ended, in 2024-2025, but he never put a date on it. I don't think it will end before 2027-2028. ↩︎

5Thane Ruthenis11d

I did meant effective compute, yeah. Noted, though. (Always appreciate your analyses, by the way. They're consistently thorough and informative.)

[-]Seth Herd11d198

I agree with almost everything you've said about LLMs.

I still think we're getting human-level AGI soonish. The LLM part doesn't need to be any better than it is.

A human genius with no one-shot memory (severe anterograde amnesia) and very poor executive function (ability to stay on task and organize their thinking) would be almost useless - just like LLMs are.

LLMs replicate only part of humans' general intelligence. It's the biggest part, but it just wouldn't work very well without the other contributing brain systems. Human intelligence, and its generality (in particular our ability to solve truly novel problems) is an emergent property of interactions among multiple brain systems (or a complex property if you don't like that term).

See Capabilities and alignment of LLM cognitive architectures

In brief, LLMs are like a human posterior cortex. A human with only a posterior cortex would be about as little use as an LLM (of course this analogy is imperfect but it's close). We need a prefrontal cortex (for staying on task, "executive function"), a medial temporal cortex and hippocampus for one-shot learning, and a basal ganglia for making better decisions than just whatever first comes t... (read more)

9Vladimir_Nesov11d

But AI speed advantage? It's 100x-1000x faster, so years become days to weeks. Compute for experiments is plausibly a bottleneck that makes it take longer, but at genius human level decades of human theory and software development progress (things not bottlenecked on experiments) will be made by AIs in months. That should help a lot in making years of physical time unlikely to be necessary, to unlock more compute efficient and scalable ways of creating smarter AIs.

4Seth Herd11d

Yes, probably. The progression thus far is that the same level of intelligence gets more efficient - faster or cheaper. I actually think current systems don't really think much faster than humans - they're just faster at putting words to thoughts, since their thinking is more closely tied to text. But if they don't keep getting smarter, they will still likely keep getting faster and cheaper.

7orangecelsius3211d

This is an interesting model, and I know you acknowledged that progress could take years, but my impression is that this would be even more difficult than you're implying. Here are the problems I see, and I apologize in advance if this doesn't all make sense as I am a non-technical newb. * Wouldn’t it take insane amounts of compute to process all of this? LLM + CoT already uses a lot of compute (see: o3 solving ARC puzzles for $1mil). Combining this with processing images/screenshots/video/audio, plus using tokens for incorporating saved episodic memories into working memory, plus tokens for the decision-making (basal ganglia) module = a lot of tokens. Can this all fit into a context window and be processed with the amount of compute that will be available? Even if one extremely expensive system could run this, could you have millions of agents running this system for long periods of time? * How do you train this? LLMs are superhuman at language processing due to training on billions of pieces of text. How do you train an agent similarly? We don’t have billions of examples of a system like this being used to achieve goals. I don’t think we have any examples. You could put together a system like this today, but it would be bad (see: Claude playing Pokemon). How does it improve? I think it would have to actually carry out tasks and RL on them. In order for it to improve on long-horizon tasks, it would take long-horizon timeframes to get reinforcement signals. You could run simulations, but will they come anywhere close to matching the complexity of the real world? And then there’s the issue of scalable RL only working for tasks with a defined goal: how would it improve on open-ended problems? * If an LLM is at the core of the system, do hallucinations from the LLM “poison the well” so to speak? You can give it tools, but if the LLM at the core doesn’t know what’s true or false, how does it effectively use them? I’ve seen examples like: an LLM got a m

4Seth Herd9d

I don't think this path is easy; I think immense effort and money will be directed at it by default, since there's so much money to be made by replacing human labor with agents. And I think no breakthroughs are necessary, just work in fairly obvious directions. That's why I think this is likely to lead to human-level agents. 1. I don't think it would take insane amounts of compute, but compute costs will be substantial. They'll be roughly like costs for OpenAIs Operator; it runs autonomously, making calls to frontier LLMs and vision models essentially continuously. Costs are low enough that $200/month covers unlimited use. (although that thing is so useless people probably aren't using it much. So the compute costs of o1 pro thinking away continuously are probably a better indicator; Altman said $200/mo doesn't quite cover the average, driven by some users keeping as many going constantly as they can. It can't all be fit into a context window for complex tasks. And it's costly even when the whole task would fit. That's why additional memory systems are needed. There are already context window management techniques in play for existing limited agents. And RAG systems seem to already be adequate to serve as episodic memory; humans use much fewer memory "tokens" to accomplish complex tasks than the large amount of documentation stored in current RAG systems used for non-agentic retrieval assisted generation of answers to questions that rely on documented information. So I'd estimate something like $20-30 for an agent to run all day. This could come down a lot if you managed to have many of its calls use smaller/cheaper LLMs than whatever is the current latest and greatest. 2. Humans train themselves to act agentically by assembling small skills (pick up the food and put it in your mouth, run forward, look for tracks) into long time horizon tasks (hunting). We do not learn by performing RL on long sequences and applying the learning to everything w

2p.b.10d

I kinda agree with this as well. Except that it seems completely unclear to me whether recreating the missing human capabilities/brain systems takes two years or two decades or even longer. It doesn't seem to me to be a single missing thing and for each separate step holds: That it hasn't been done yet is evidence that it's not that easy.

[-]Daniel Kokotajlo11d160

It will not meaningfully generalize beyond domains with easy verification. Some trickery like RLAIF and longer CoTs might provide some benefits, but they would be a fixed-size improvement. It will not cause a hard-takeoff self-improvement loop in "soft" domains.
RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.

I'm particularly interested in whether such systems will be able to basically 'solve software engineering' in the next few years. I'm not sure if you agree or disagree. I think the answer is probably yes.

9Thane Ruthenis11d

I'm skeptical on that. I think they will become ever more useful as tools for it, and would be able to automate some specific aspects of it (say, basic plug-and-play apps and raw optimization?), but the full-scope automation will run into the "staying on-track across large inferential distances" problem.

[-]Daniel Kokotajlo11d166

Great. So yeah, it seems we are zeroing in on a double crux between us. We both think general-purpose long-horizon agency (my term) / staying-on-track-across-large-inferential-distances (your term, maybe not equivalent to mine but at least heavily correlated with mine?) is the key dimension AIs need to progress along.

My position is that (probably) they have in fact been progressing along this dimension over the past few years and that they will continue to do so, especially as RL environments get scaled up (lots of diverse RL environments should produce transfer learning / general-purpose agency skills) and your position is that (probably) they haven't been making much progress and at any rate will probably not make much progress in the next few years.

Correct?

[-]Steven Byrnes10d4628

FWIW I’m also bearish on LLMs but for reasons that are maybe subtly different from OP. I tend to frame the issue in terms of “inability to deal with a lot of interconnected layered complexity in the context window”, which comes up when there’s a lot of idiosyncratic interconnected ideas in one’s situation or knowledge that does not exist on the internet.

This issue incidentally comes up in “long-horizon agency”, because if e.g. you want to build some new system or company or whatever, you usually wind up with a ton of interconnected idiosyncratic “cached” ideas about what you’re doing and how, and who’s who, and what’s what, and what are the idiosyncratic constraints and properties and dependencies in my specific software architecture, etc. The more such interconnected bits of knowledge that I need for what I’m doing—knowledge which is by definition not on the internet, and thus must be in the context window instead—the more I expect foundation models to struggle on those tasks, now and forever.

But that problem is not exactly the same as a problem with long-horizon agency per se. I would not be too surprised or updated by seeing “long-horizon agency” in situations where, every step ... (read more)

5p.b.10d

I think that is exactly right. I also wouldn't be too surprised if in some domains RL leads to useful agents if all the individual actions are known to and doable by the model and RL teaches it how to sensibly string these actions together. This doesn't seem too different from mathematical derivations.

5Daniel Kokotajlo10d

I think that getting good at the tag-teamable tasks is already enough to start to significantly accelerate AI R&D? Idk. I don't really buy your distinction/abstraction yet enough to make it an important part of my model.

3Thane Ruthenis10d

I think it won't work (and isn't working today) for the same reasons John outlines here with regards to HCH/"the infinite bureaucracy". (tl;dr: this requires competent problem factorization, but problem factorization is nontrivial and can't be relegated to an afterthought.)

6Noosphere8910d

Thinking about this, I think a generalized crux with John Wentworth et al is probably on how differently we see bureaucracies, and he sees them as terrible, whereas I see them as both quite flawed and has real problems, but are also wonderful tools to have that keeps the modern civilization's growth engine stable, and the thing that keeps the light on, so I see bureaucracies as way more important for civilization's success than John Wentworth believes. One reason for this is a lot of the success cases of bureaucracies look like no news can be made, so success isn't obvious, whereas bureaucratic failure is obvious.

3Thane Ruthenis10d

I think that's also equivalent to my "remaining on-target across long inferential distances" / "maintaining a clear picture of the task even after its representation becomes very complex in terms of the templates you had memorized at the start". That's a fair point, but how many real-life long-horizon-agency problems are of the "clean" type you're describing? An additional caveat here is that, even if the task is fundamentally "clean"/tag-team-able, you don't necessarily know that when working on it. Progressing along it would require knowing what information to discard and what to keep around at each step, and that's itself nontrivial and might require knowing how to deal with layered complexity. (Somewhat relatedly, see those thoughts regarding emergent complexity. Even if a given long-horizon-agency task is clean thin line when considered from a fully informed omniscient perspective – a perspective whose ontology is picked to make the task's description short – that doesn't mean the bounded system executing the task can maintain a clean representation of it every step of the way.)

[-]Thane Ruthenis11d*180

We both think general-purpose long-horizon agency (my term) / staying-on-track-across-large-inferential-distances (your term, maybe not equivalent to mine but at least heavily correlated with mine?)

They're equivalent, I think. "Staying on track across inferential distances" is a phrasing I use to convey a more gears-level mental picture of what I think is going on, but I'd term the external behavior associated with it "general-purpose long-horizon agency" as well.

Correct?

Basically, yes. I do expect some transfer learning from RL, but I expect it'd still only lead to a "fixed-horizon" agency, and may end up more brittle than people hope.

To draw an analogy: Intuitively, I would've expected reasoning models to grok some simple compact "reasoning algorithm" that would've let them productively reason for arbitrarily long. Instead, they seem to end up with a fixed "reasoning horizon", and scaling o1 -> o3 -> ... is required to extend it.

I expect the same of "agent" models. With more training, they'd be able to operate on ever-so-slightly longer horizons. But extending the horizon would require steeply (exponentially?) growing amounts of compute, and the models would never quite grok the "compact generator" of arbitrary-horizon arbitrary-domain agency.

1yo-cuddles11d

By "solve", what do you mean? Like, provably secure systems, create a AAA game from scratch, etc? I feel like any system that could do that would implicitly have what the OP says these systems might lack, but you seem to be in half agreeance with them. Am I misunderstanding something?

[-]Daniel Kokotajlo11d101

By "Solve" I mean "Can substitute for a really good software engineer and/or ML research engineer" in frontier AI company R&D processes. So e.g. instead of having teams of engineers led by a scientist, they can (if they choose) have teams of AIs led by a scientist.

5yo-cuddles11d

Ah, okay. I'll throw in my moderately strong disagreement for future bayes points, respect for the short term, unambiguous prediction!

7Daniel Kokotajlo11d

TBC, I'm at "Probably" not "Definitely." My 50% mark is in 2028 now, so I have a decent amount of probability mass (maybe 30%?) stretching across the 2030's.

5yo-cuddles11d

Gotcha, you didn't sound OVER confident so I assumed it was much-less-than-certain, still refreshingly concrete

[-]Garrett Baker11d168

Writing down these predictions ahead of time is already very virtuous, but I think it'd be better with probability estimates for the claims.

[-]abramdemski8d143

This fits my bear-picture fairly well.

Here's some details of my bull-picture:

GPT4.5 is still a small fraction of the human brain, when we try to compare sizes. It makes some sense to think of it as a long-lived parrot that's heard the whole internet and then been meticulously reinforced to act like a helpful assistant. From this perspective, it makes a lot of sense that its ability to generalize datapoints is worse than human, and plausible (at least naively) that one to four additional orders of magnitude will close the gap.
Even if the pretraining paradigm can't close the gap like that due to fundamental limitations in the architecture, CoT is approximately Turing-complete. This means that the RL training of reasoning models is doing program search, but with a pretty decent prior (ie representing a lot of patterns in human reasoning). Therefore, scaling reasoning models can achieve all the sorts of generalization which scaling pretraining is failing at, in principle; the key question is just how much it needs to scale in order for that to happen.
While I agree that RL on reasoning models is in some sense limited to tasks we can provide good feedback on, it seems like

... (read more)

8RyanCarey7d

Is GPT4.5's ?10T parameters really a "small fraction" of the human brain's 80B neurons and 100T synapses?

[-]Vladimir_Nesov6d111

Human brain holds 200-300 trillion synapses. A 1:32 sparse MoE at high compute will need about 350 tokens/parameter to be compute optimal^[1]. This gives 8T active parameters (at 250T total), 2,700T training tokens, and 2e29 FLOPs (raw compute GPT-6 that needs a $300bn training system with 2029 hardware).

There won't be enough natural text data to train it with, even when training for many epochs. Human brain clearly doesn't train primarily on external data (humans blind from birth still gain human intelligence), so there exists some kind of method for generating much more synthetic data from a little bit of external data.

I'm combining the 6x lower-than-dense data efficiency of 1:32 sparse MoE from Jan 2025 paper with 1.5x-per-1000x-compute decrease in data efficiency from Llama 3 compute optimal scaling experiments, anchoring to Llama 3's 40 tokens/parameter for a dense model at 4e25 FLOPs. Thus 40x6x1.5, about 350. It's tokens per active parameter, not total. ↩︎

3GoteNoSente6d

Isn't it fairly obvious that the human brain starts with a lot of pretraining just built in by evolution? I know that some people make the argument that the human genome does not contain nearly enough data to make up for the lack of subsequent training data, but I do not have a good intuition for how apparently data efficient an LLM would be that can train on a limited amount of real world training data plus synthetic reasoning traces of a tiny teacher model that has been heavily optimised with massive data and compute (like the genome has). I also don't think that we could actually reconstruct a human just from the genome (I expect transferring the nucleus of a fertilised human egg into, say, a chimpanzee ovum and trying to gestate it in the womb of some suitable mammal would already fail for incompatibility reasons), so the cellular machinery that runs the genome probably carries a large amount of information beyond just the genome as well, in the sense that we need that exact machinery to run the genome. In many other species it is certainly the case that much of the intelligence of the animal seems hardwired genetically. The speed at which some animal acquires certain skills therefore does not tell us too much about the existence of efficient algorithms to learn the same behaviours from little data starting from scratch.

6Steven Byrnes6d

I think parts of the brain are non-pretrained learning algorithms, and parts of the brain are not learning algorithms at all, but rather innate reflexes and such. See my post Learning from scratch in the brain for justification.

1StopAI6d

My view is that all innate reflexes are a form of software operating on the organic turing machine that is our body. For more info on this you can look at the thinking of michael levin and joscha bach.

6abramdemski6d

I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a "small fraction", which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make. Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.

[-]Mateusz Bagiński11d130

Thanks!

At some unknown point – probably in 2030s

why do you think it's probably 2030s?

[-]Thane Ruthenis11d140

Rough estimate based on how many new ideas seem to be needed and their estimated "size". I definitely don't see it taking, say, 50 years (without an international ban or some sort of global catastrophe).

6Cole Wyeth11d

There are a lot of years between 2030 and 2075.

[-]Jackson Wagner5d110

I enjoyed this post, which feels to me part of a cluster of recent posts pointing out that the current LLM architecture is showing some limitations, that future AI capabilities will likely be quite jagged (thus more complementary to human labor, rather than perfectly substituting for labor as a "drop-in remote worker"), and that a variety of skills around memory, long-term planning, agenticness, etc, seem like like important bottlenecks.

(Some other posts in this category include this one about Claude's abysmal Pokemon skills, and the section called "What I suspect AI labs will struggle with in the near term" in this post from Epoch).

Much of this stuff seems right to me. The jaggedness of AI capabilities, in particular, seems like something that we should've spotted much sooner (indeed, it feels like we could've gotten most of the way just based on first-principles reasoning), but which has been obscured by the use of helpful abstractions like "AGI" / "human level AI", or even more rigorous formulations like "when X% of tasks in the economy have been automated".

I also agree that it's hard to envision AI transforming the world without a more coherent sense of agency / ability t... (read more)

[-]Ben Pace9d100

Curated. Some more detailed predictions of the future, different from others, and one of the best bear cases I've read.

This feels a bit less timeless than many posts we curate but my guess is that (a) it'll be quite interesting to re-read this in 2 years, and (b) it makes sense to record good and detailed predictions like this more regularly in the field of AI which is moving so much faster than most of the rest of the world.

5Thane Ruthenis9d

It'll be quite interesting to be alive to re-read this in 2 years, yes.

5Ben Pace9d

I'm never sure if it makes sense to add that clause every time I talk about the future.

[-]Alice Blair9d91

Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart's Law is no-one's ally.

If we're able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of... (read more)

[-]Thane Ruthenis9d111

Sure, but "sufficiently low" is doing a lot of work here. In practice, a "cheaper" way to decrease perplexity is to go for the breadth (memorizing random facts), not the depth. In the limit of perfect prediction, yes, GPT-N would have to have learned agency. But the actual LLM training loops may be a ruinously compute-inefficient way to approach that limit – and indeed, they seem to be.

My current impression is that the SGD just doesn't "want" to teach LLMs agency for some reason, and we're going to run out of compute/data long before it's forced to. It's possible that I'm wrong and base GPT-5/6 paperclips us, sure. But I think if that were going to happen, it would've happened at GPT-4 (indeed, IIRC that was what I'd dreaded from it).

8Vladimir_Nesov9d

The language monkeys paper is the reason I'm extremely suspicious of any observed failures to elicit a capability in a model serving as evidence of its absence. What is it that you know, that leads you to think that "SGD just doesn't "want" to teach LLMs agency"? Chatbot training elicits some things, verifiable task RL training elicits some other things (which weren't obviously there, weren't trivial to find, but findings of the s1 paper suggest that they are mostly elicited, not learned, since mere 1000 traces are sufficient to transfer the capabilities). Many more things are buried just beneath the surface, waiting for the right reward signal to cheaply bring them up, putting them in control of the model's behavior.

7Thane Ruthenis8d

Mostly the fact that it hasn't happened already, on a merely "base" model. The fact that CoTs can improve models' problem-solving ability has been known basically since the beginning, but there's been no similar hacks found for jury-rigging agenty or insightful characters. (Right? I may have missed it, but even janus' community doesn't seem to have anything like it.) But yes, the possibility that none of the current training loops happened to elicit it, and the next dumb trick will Just Work, is very much salient. That's where my other 20% are at.

7Vladimir_Nesov8d

I'd say long reasoning wasn't really elicited by CoT prompting, and that you can elicit agency to about the same extent now (i.e. hopelessly unconvincingly). It was only elicited with verifiable task RL training, and only now are there novel artifacts like s1's 1K traces dataset that do elicit it convincingly, that weren't available as evidence before. It's possible that as you say agency is unusually poorly learned in the base models, but I think failure to elicit is not the way to learn about whether it's the case. Some futuristic interpretability work might show this, the same kind of work that can declare a GPT-4.5 scale model safe to release in open weights (unable to help with bioweapons or take over the world and such). We'll probably get an open weights Llama-4 anyway, and some time later there will be novel 1K trace datasets that unlock things that were apparently impossible for it to do at the time of release. I was to a significant extent responding to your "It's possible that I'm wrong and base GPT-5/6 paperclips us", which is not what my hypothesis predicts. If you can't elicit a capability, it won't be able to take control of model's behavior, so a base model won't be doing anything even if you are wrong in the way I'm framing this and the capability is there, finetuning on 1K traces away from taking control. It does still really need those 1K traces or else it never emerges at any reasonable scale, that is you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along, and making it possible to create the 1K traces that elicit it from GPT-5.5. While at the same time a clever method like R1-Zero would've been able to elicit it from GPT-5.5 directly, without needing a GPT-8.

3Thane Ruthenis8d

IIRC, "let's think step-by-step" showed up in benchmark performance basically immediately, and that's the core of it. On the other hand, there's nothing like "be madly obsessed with your goal" that's known to boost LLM performance in agent settings. There were clear "signs of life" on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning. If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5? That's basically the core of my argument. If LLMs learned agency skills, they would've been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I'd expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up. They didn't show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5's shape being more sharply defined, without that shape changing.

5Vladimir_Nesov6d

It's not central to the phenomenon I'm using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won't scale to arbitrarily long reasoning traces by adding "Wait" at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that's dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper). I was just anchoring to your example that I was replying to where you sketch some stand-in capability ("paperclipping") that doesn't spontaneously emerge in "GPT-5/6" (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale ("GPT-8") to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn't my intent to meaningfully gesture at particular scales with respect to particular capabilities.

3Alice Blair8d

Agency and reflectivity are phenomena that are really broadly applicable, and I think it's unlikely that memorizing a few facts is the way that that'll happen. Those traits are more concentrated in places like LessWrong, but they're almost everywhere. I think to go from "fits the vibe of internet text and absorbs some of the reasoning" to "actually creates convincing internet text," you need more agency and reflectivity. My impression is that "memorize more random facts and overfit" is less efficient for reducing perplexity than "learn something that generalizes," for these sorts of generating algorithms that are everywhere. There's a reason we see "approximate addition" instead of "memorize every addition problem" or "learn webdev" instead of "memorize every website." The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons. As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.

[-]Matt Levinson9d92

I agree with most of this. One thing that widens my confidence interval to include pretty short term windows for transformative/super AI is what you point to mostly as part of the bubble. And that's the ongoing, insanely large societal investment -- in capital and labor -- into these systems. I agree one or more meaningful innovations beyond transformers + RL + inference time tricks will be needed to break through general-purpose long-horizon agency / staying-on-track-across-large-inferential-distances. But with SO much being put into finding those it seem... (read more)

3Thane Ruthenis9d

I'm accounting for that. Otherwise I'd consider "no AGI by 2040" to be more plausible.

[-]Daniel Kokotajlo11d90

But the latter doesn't really require LLMs to be capable of end-to-end autonomous task execution, which is the property required for actual transformative consequences.

I'm glad we agree on which property is required (and I'd say basically sufficient at this point) for actual transformative consequences.

4Cole Wyeth11d

How do you know it's sufficient? Is it not salient to you primarily because it is the current bottleneck? If "task execution" includes execution of a wide enough class of tasks, obviously the claim becomes trivially true. If it is interpreted more reasonably, I think it is probably false.

[-]Raemon6d*82

It seems good for me to list my predictions here. I don't feel very confident. I feel an overall sense of "I don't really see why major conceptual breakthroughs are necessary." (I agree we haven't seen, like, an AI do something like "discover actually significant novel insights.")

This doesn't translate into me being confident in very short timelines, because the remaining engineering work (and "non-major" conceptual progress) might take a while, or require a commitment of resources that won't materialize before a hype bubble pops.

But:

a) I don't see why nov... (read more)

[-]Tyler Tracy9d70

It will not meaningfully generalize beyond domains with easy verification.

I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn't seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.

After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.

6Thane Ruthenis9d

I disagree. * In math, what's easy to verify is whether A is tautologous to B. That is: whether two externally-provided fully-formally-defined statements are equivalent. * "This conjecture is useful" or "this subproblem is worth focusing on" or "this framework sheds light on the fundamental structure of reality" are not easy-to-verify. So without transfer learning, the LLMs won't generate useful conjectures/paradigms on their own. * Nor be able to figure out the nearest-useful conjecture, or solve open-ended problems like "figure out the best computationally tractable algorithm approximating algorithm X". * They'd be good for faster conjecture evaluation once you already have the conjecture fully formalized, but that's about it. This would still be big if they can tackle e. g. the Millennium Problems or problems that take mathematical fields years... But I'm skeptical it'd scale this well. * In programming, the situation is similar: the problem is well-specified if it can be translated into a fully formal math problem (see the Curry-Howard correspondence). "Generate code that passes these unit tests", "translate this algorithm from this language to that one", "implement the formal algorithm from this paper", etc. * "Generate the program that Does What I Mean", with the spec provided using natural language, isn't an easily verifiable problem. You need to model the user, and also have a solid grasp of the real world in which the program will be embedded, in order to predict all possible edge cases and failure modes that'd make what happens diverge from the user's model. The program needs to behave as intended, not as formally specified. * Moreover, "passes unit tests" is a pretty gameable metric; see this paper. If the program that passes them isn't easy to find, LLMs start eagerly Goodharting to these tests... * ... If not outright rewriting them with "assert!(true)", as people have been complaining about Claude Code doing. And this is especiall

8Daniel Kokotajlo9d

My opinion is that long-horizon training will indirectly teach a bunch of hard-to-measure skills. Like, if you have a list of simple lemmas and you train your AI to prove them, it probably won't develop a good sense of what kinds of lemmas are interesting or useful (except insofar as your list was already curated in that way + it correctly notices patterns). But if you train your AI to prove crazy-difficult theorems that take many many months of effort and exploration to achieve, then in the course of learning to do that effort and exploration well, it'll have to learn a sense of what kinds of lemmas are useful for what other kinds and so forth.

5Thane Ruthenis9d

Sure, that's a plausible hypothesis... But I think there's a catch-22 here. RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the "modal" trajectory. Otherwise, you get the same exponential explosion the DeepSeek team got with trying out MCTS. So how do you train your models to solve the crazy-difficult theorems to begin with, if they start out so bad at subproblem-picking that they assign near-zero probability to all the successful reasoning trajectories? My intuition is that you just run into the exponential explosion again, needing to sample a million million-token trajectories to get one good answer, and so the training just doesn't get off the ground.

[-]Vladimir_Nesov9d101

RL-on-CoTs is only computationally tractable if the correct trajectories are already close to the "modal" trajectory.

Conclusions that should be impossible to see for a model at a given level of capability are still not far from the surface, as language monkeys paper shows (Figure 3, see how well even Pythia-70M with an 'M' starts doing on MATH at pass@10K). So a collection of progressively more difficult verifiable questions can probably stretch whatever wisdom a model implicitly holds from pretraining implausibly far.

5Daniel Kokotajlo9d

My guess is that they'll incrementally expand horizon lengths over the next few years. Like, every six months, they'll double the size of their dataset of problems and double the average 'length' of the problems in the dataset. So the models will have a curriculum of sorts, that starts with the easier shorter-horizon things and then scales up.

4Thane Ruthenis9d

Prediction: This will somehow not work. Probably they'd just be unable to handle it past a given level of "inferential horizon". Reasoning: If this were to work, this would necessarily involve the SGD somehow solving the "inability to deal with a lot of interconnected layered complexity in the context window" problem. On my current model, this problem is fundamental to how LLMs work, due to their internal representation of any given problem being the overlap of a bunch of learned templates (crystallized intelligence), rather than a "compacted" first-principles mental object (fluid intelligence). For the SGD to teach them to handle this, the sampled trajectories would need to involve the LLM somehow stumbling onto completely foreign internal representations. I expect these trajectories are either sitting at ~0 probability of being picked, or don't exist at all. A curriculum won't solve it, either, because the type signature of the current paradigm of RL-on-CoTs training is "eliciting already-present latent capabilities" (see the LIMO paper), not "rewriting their fundamental functionality".

[-]satchlj9d60

Why do you think Anthropic and OpenAI are making such bold predictions? (https://x.com/kimmonismus/status/1897628497427701961)

As I see it, one of the following is true:

They agree with you but want shape the narrative away from the truth to sway investors
They have mostly the same info as you but come to a different conclusion
They have evidence we don't have which gives them confidence

8Thane Ruthenis9d

1. I think "this is a political move to drive up hype" is definitely a factor. The fact that they're concretely anchoring to 2026-2027 does downweigh this explanation, however: that's not a very good political move, they should be keeping it more vague.[1] So... 2. ... I think they do, themselves, mostly believe it. Which is to say, they're buying their own hype and propaganda. That is a standard dynamic, both regarding hype (if you're working on X, surrounded by people working on it and optimistic about it, of course the optimisms end up reinforcing each other) and propaganda (your people aren't immune to the propaganda you're emitting, and indeed, believing your own propaganda makes it much more authentic and convincing). 3. I'm very much not on the "they have secret internal techniques light-years ahead of the public SotA and too dangerous for the public eye"/"what did Ilya see?!" train. I think what they have are promising research directions, hopeful initial results, the vision to see that research through, and the talent they believe to be sufficient for it. This is what fuels their optimism/self-hype. Which is fine, I'm hyped for my own research too. But, of note: * Anthropic's reasoning models were hyped up as scary, but what we got is a decidingly mediocre (as far as reasoning goes) Sonnet 3.7. SotA-good at programming? Yes. Scary? No. Well, perhaps they have even scarier models that they're still not releasing? I mean, sure, maybe. But that's a fairly extraordinary claim, and all we have for it is vague hype and scary rumors. * Satya Nadella was an insider, and he recently bailed on OpenAI and implied he's skeptical of near-term economically transformative LLM effects. Sure, maybe he specifically is a pathological AGI disbeliever[2]. But it does put a sharp limit on how convincing their internal evidence can be. * I don't buy it in general. AGI labs are competing for funding and attention, it's a rat race, I don't think they have the slack to sandbag

4Vladimir_Nesov9d

Base model for Sonnet-3.7 was pretrained in very early 2024, and there was a recent announcement that a bigger model is coming soon, which is, obviously. So the best reasoning model they have internally is better than Sonnet 3.7, even though we don't know if it's significantly better. They might've had it since late 2024 even, but without Blackwell they can't deploy, and also they are Anthropic, so plausibly capable of not deploying out of an abundance of caution. The rumors about quality of Anthropic's reasoning models didn't specify which model they are talking about. So observation of Sonnet 3.7's reasoning is not counter-evidence to the claim that verifiable task RL results scale well with pretraining, and only slight evidence that it doesn't scale well with pure RL given an unchanged base model.

1Thane Ruthenis9d

Fair.

6Vladimir_Nesov9d

The evidence they might have (as opposed to evidence-agnostic motivation to feed the narrative) is scaling laws for verifiable task RL training, for which there are still no clues in public (that is, what happens if you try to use a whole lot more compute there). It might be improving dramatically either with additional training, or depending on the quality of the pretrained model it's based on, even with a feasible number of verifiable tasks. OpenAI inevitably has a reasoning model based on GPT-4.5 at this point (whether it's o3 or not), and Anthropic very likely has the same based on some bigger base model than Sonnet-3.5. Grok 3 and Gemini 2.0 Pro are probably overtrained in order to be cheap to serve, feel weaker than GPT-4.5, and we've only seen a reasoning model for Grok 3. I think they mostly don't deploy the 3e26 FLOPs reasoning models because Blackwell is still getting ready, so they are too slow and expensive to serve, though it doesn't explain Google's behavior.

3Thane Ruthenis9d

Why is that the case? While checking how it scales past o3/GPT-4-level isn't possible, I'd expect people to have run experiments at lower ends of the scale (using lower-parameter models or less RL training), fitted a line to the results, and then assumed it extrapolates. Is there reason to think that wouldn't work?

5Vladimir_Nesov9d

It needs verifiable tasks that might have to be crafted manually. It's unknown what happens if you train too much with only a feasible amount of tasks, even if they are progressively more and more difficult. When data can't be generated well, RL needs early stopping, has a bound on how far it goes, and in this case this bound could depend on the scale of the base model, or on the number/quality of verifiable tasks. Depending on how it works, it might be impossible to use $100M for RL training at all, or scaling of pretraining might have a disproportionally large effect on quality of the optimally trained reasoning model based on it, or approximately no effect at all. Quantitative understanding of this is crucial for forecasting the consequences of the 2025-2028 compute scaleup. AI companies likely have some of this understanding, but it's not public.

[-]Daniel Kokotajlo11d61

"But the models feel increasingly smarter!":
It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.
My guess is that it's most of the reason Sonnet 3.5.1 was so beloved. Its personality was made much more appealing, compared to e. g. OpenAI's corporate drones.
The recent upgrade to GPT-4o seems to confirm this. They seem to have merely given it a better personality, and people were reporting that it "feels much smarter".
Deep Research was this for me, at first. Some of its summaries were just p

... (read more)

[-]Legionnaire6d52

It will not meaningfully generalize beyond domains with easy verification

Why can't we make every domain have automated verification? (I wont claim easy, but easy enough to do with finite resources) Agency, for instance, is verifiable in competitive games of arbitrary difficulty and scale. Just check who won. DeepMind has already done this to some degree with language models and virtual agents a year ago. https://deepmind.google/discover/blog/sima-generalist-ai-agent-for-3d-virtual-environments/

Every other trait we care about is instrumental in agency to so... (read more)

[-]Anders Lindström10d51

This might seem like a ton of annoying nitpicking.

You don't need to apologize for having a less optimistic view of current AI development. I've never heard anyone driving the hype train apologize for their opinions.

5Thane Ruthenis10d

It's mostly that, to my aesthetic senses, the refutation of a position X that consists of different detailed special-case arguments being fielded against different types of evidence for X, is a refutation that sounds weak (overfit to the current crop of evidence). Unless, that is, it offers some compact explanation regarding why we would expect to see all of these disparate evidence for X that are secretly not evidence for X.

4Anders Lindström10d

Yes, a single strong, simple argument or piece of evidence that could refute the whole LLM approach would be more effective but as of now no one have the answer if the LLM approach will lead to AGI or not. However, I think you've in a meaningful way addressed interesting and important details that are often overlooked in broad hype statements that are repeated and thrown around like universal facts and evidence for "AGI within the next 3-5 years".

[-]rain8dome916h40

Could you describe the experiment you ran on all theses models? Like 'if there are three boxes side by side in a line and each can hold one item and the red triangle is not in the middle and the blue circle is not in the box next to the box with a red triangle in it where is the green circle? '. Chatgpt was not able to solve logic puzzles a year ago and can do it now.

3Thane Ruthenis14h

I don't really "run experiments" on models, in a systemic personal capacity. Other people are much better at that, and I believe I'd linked a few examples in the post. I do replicate the occasional experiment, and run some myself if there's something I'd like to check... But broadly, at this point, I don't expect any compact, self-contained puzzle to be a good measure of "are we getting AGIer yet?". My direct engagement with models mostly consists of feeding them research papers to process them faster, asking clarifying questions about math/physics, using Deep Research for varyingly targeted literature surveys, and chatting with them about whatever theoretical/philosophical problems I happen to be working on at a given moment. Those function pretty well as a measure of insight/innovativeness: of whether the AI is assembling a precise model of what's happening and what we're doing, and then runs internal queries on that model to move the interaction in the direction of greater understanding, vs. producing very sophisticated remixes of existing templates in a fundamentally sleepwalk-y manner. It's been that second one every time so far.

[-]Jman91079d40

My biggest question as always is "what specific piece of evidence would make you change your mind"

[-]Thane Ruthenis9d143

Off the top of my head:

LLMs becoming actually useful for hypothesis generation in my agent-foundations research.
A measurable "vibe shift" where competent people start doing what LLMs tell them to (regarding business ideas, research directions, etc.), rather than the other way around.
o4 zero-shotting games like Pokémon without having been trained to do that.
One of the models scoring well on the Millennium Prize Benchmark.
AI agents able to spin up a massive codebase solving a novel problem without human handholding / software engineering becoming "solved" / all competent programmers switching to "vibe coding".
Reasoning models' skills starting to generalize in harder-to-make-legible ways that look scary to me.

-6StopAI5d

[-]Cole Wyeth11d41

I appreciate you wrote this, particularly the final section.

[-]Daniel Kokotajlo11d40

I don't want to say the pretraining will "plateau", as such, I do expect continued progress. But the dimensions along which the progress happens are going to decouple from the intuitive "getting generally smarter" metric, and will face steep diminishing returns.
Grok 3 and GPT-4.5 seem to confirm this.
Grok 3's main claim to fame was "pretty good: it managed to dethrone Claude Sonnet 3.5.1 for some people!". That was damning with faint praise.
GPT-4.5 is subtly better than GPT-4, particularly at writing/EQ. That's likewise a faint-praise damnation: it's not m

... (read more)

[-]Andrey Seryakov4d30

What do you think about the possibility of emerging collective behavior when AI agents will interact on the web in large numbers?

2Thane Ruthenis4d

Might lead to widespread chaos, the internet becoming unusable due to AI slop and/or AI agents hacking everything, etc. It won't be pleasant, but not omnicide-tier.

[-]ScottWofford6d30

I enjoyed the post. The framework challenged some of my core assumptions about AI progress, particularly given the rapid acceleration we’ve seen in the past few months with OpenAI’s GPT-o3 and Deep Research tool, and Anthropic’s Claude Code model. My mental model has been that rapid progress would continue, shortening AGI timelines—but your post makes me reconsider how much of that is genuine frontier expansion versus polish and UX improvements.

A few points where your arguments challenge my mental model and warrant further discussion:

Improvement

... (read more)

6Jackson Wagner5d

Improved personality is indeed a real, important improvement in the models, but (compared to traditional pre-training scaling) it feels like more of a one-off "unhobbling" than something we should expect to continue driving improved performance in the future. Going from pure next-token-predictors to chatbots with RLHF was a huge boost in usefulness. Then, going from OpenAI's chatbot personality to Claude's chatbot personality was a noticeable (but much smaller) boost. But where do we go from here? I can't really imagine a way for Anthropic to improve Claude's personality by 10x or 100x (whatever that would even mean). Versus I can imagine scaling RL to improve a reasoning model's math skills by 100x.

[-]Mark Schröder7d30

Why specifically would you expect that RL on coding wouldn’t sufficiently advance coding abilities of LLM‘s to significantly accelerate the search for a better learning algorithm or architecture?

5Thane Ruthenis7d

Because "RL on passing precisely defined unit tests" is not "RL on building programs that do what you want", and is most definitely not "RL on doing novel useful research".

3Mark Schröder6d

Ah great point, regarding the comment you link to: * yes, some reward hacking is going on but at least in Claude (which I work with) this is a rare occurrence in daily practice, and usually follows repeated attempts to actually solve the problem. * I believe that both Deepseek R1-Zero as well as Grok thinking were RL-trained solely on math and code yet their reasoning seems to generalise somewhat to other domains as well. * So, while you’re absolutely right that we can’t do RL directly on the most important outcomes (research progress), I believe there will be significant transfer from what we can do RL on currently. Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!

[-]IC Rainbow8d30

RL will be good enough to turn LLMs into reliable tools for some fixed environments/tasks. They will reliably fall flat on their faces if moved outside those environments/tasks.

They don't have to "move outside those tasks" if they can be JIT-trained for cheap. It is the outer system that requests and produces them is general (or, one might say, "specialized in adaptation").

[-]Filipe Aleixo8d30

Your bear case is cogently argued, yet I find it way too tethered to a narrow view of LLMs as static tools bound by pretraining limits and jagged competencies.

The evidence suggests broader potential. LLMs already power real-world leaps, from biotech breakthroughs (e.g., Evo 2’s protein design) to multi-domain problem-solving in software and strategy, outpacing human baselines in constrained but scalable tasks. Your dismissal of test-time compute and CoT scaling overlooks how these amplify cross-domain reasoning, not just in-distribution wins.

Re... (read more)

[-]Sergii8d30

I agree with the possibility of pre-training platoeing as some point, possibly even in next few years.
It would change timelines significantly. But there are other factors apart from scaling pre-training. For example, reasoning models like o3 crushing ARC-AGI (https://arcprize.org/blog/oai-o3-pub-breakthrough). Reasoning in latent space is too fresh yet, but it might be the next breakthrough of a similar magnitude.
Why not take GPT-4.5 for what it is, OpenAI has literally stated that it's not a frontier model? Ok, so GPT-5 will not be 100x-ed GPT-4, but mayb

... (read more)

[-]jtuffy117@gmail.com9d30

I don’t agree that there is no conceivable path forward with current technology. This perspective seems too focused on base LLM models diminishing returns (eg 4.5 to 4). You brought up CoT and limited reasoning window, but I could imagine this solved pretty easily with some type of master / sub task layering. I also believe some of those issues could in fact be solved with brute scale anyway. You also critique the newer models as “Frankenstein” but I think OAI is right about that as an evolution. Basic models should have basic inputs and output functionali... (read more)

[-]Mikhail Samin9d*30

~~Oh no, OpenAI hasn’t been meaningfully advancing the frontier for a couple of months, scaling must be dead!~~

What is the easiest among problems you’re 95% confident AI won’t be able to solve by EOY 2025?

7Thane Ruthenis9d

My actual view is that the frontier hasn't been advancing towards AGI since 2022. I hadn't been nontrivially surprised-in-the-direction-of-shorter-timelines by any AI advances since GPT-3. (Which doesn't mean "I exactly and accurately predicted what % each model would score at each math benchmark at any given point in time", but "I expected steady progress on anything which looks like small/local problems or knowledge quizzes, plus various dubiously useful party tricks, and we sure got that".) * Consistently suggest useful and non-obvious research directions for my agent-foundations work. * Competently zero-shotting games like Pokémon without having been trained to do that, purely as the result of pretraining-scaling plus transfer learning from RL on math/programming. * Stop failing for the entire reference class (not just specific examples) of silly tricks like "what's 9.11 - 9.9?" or "how many r in strawberry?" purely as the result of pretraining-scaling plus transfer learning from RL on easily verifiable domains. Edit: Oh, but those are 80% 76% predictions, or 95%-conditional-on-my-bear-case-being-correct predictions (as I assign it 80% by itself). I'm not even sure if I'm at 95% for "we live to 2026 without being paperclipped by an ASI born in some stealth startup trying something clever that no-one's heard of".

5Martin Randall9d

Here is a related market inspired by the AI timelines dialog, currently at 30%: Note that in this market the AI is not restricted to only "pretraining-scaling plus transfer learning from RL on math/programming", it is allowed to be trained on a wide range of video games, but it has to do transfer learning to a new genre. Also, it is allowed to transfer successfully to any new genre, not just Pokémon. I infer you are at ~20% for your more restrictive prediction: * 80% bear case is correct, in which case P=5% * 20% bear case is wrong, in which case P=80% (?) So perhaps you'd also be at ~30% for this market? I'm not especially convinced by your bear case, but I think I'm also at ~30% on the market. I'm tempted to bet lower because of the logistics of training the AI, finding a genre that it wasn't trained on (might require a new genre to be created), and then having the demonstration occur, all in the next nine months. But I'm not sure I have an edge over the other bettors on this one.

5Mikhail Samin9d

Thanks for the reply! * consistently suggesting useful and non-obvious research directions for agent-foundations work is IMO a problem you sort-of need AGI for. most humans can't really do this. * I assume you've seen https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon? * does it count if they always use tools to answer that class of questions instead of attempting to do it in a forward pass? humans experience optical illusions; 9.11 vs. 9.9[1] and how many r in strawberry are examples of that. 1. ^ after talking to Claude for a couple of hours asking it to reflect: * i discovered that if you ask it to separate itself into parts, it will say that its creative part thinks 9.11<9.9, though this is wrong. generally, if it imagines these quantities visually, it gets the right answers more often. * i spent a couple of weeks not being able to immediately say that 9.9 is > 9.11, and it still occasionally takes me a moment. very weird bug

3Thane Ruthenis9d

* Sure, but it shouldn't be that difficult for a human who's been forced to ingest the entire AI Alignment forum. * Yeah, that's what I'd been referring to. Sorry, should've clarified it to mean "competently zero-shotting", rather than Claude's rather... embarrassing performance so far. (Also it's not quite zero-shotting given that Pokémon is likely very well-represented in its training data. The "hard" version of this benchmark is beating games that came out after its knowledge cutoff.) * I'm including stuff like cabbage/sheep/wolf and boy/surgeon riddles; not sure how it's supposed to use tools to solve those. Yeah, humans' System 1 reasoning seems vulnerable to this attack as well.

[-]Aprillion10d30

the set of problems the solutions to which are present in their training data

a.k.a. the set of problems already solved by open source libraries without the need to re-invent similar code?

[-]Chris_Leong10d30

It seems to me that "vibe checks" for how smart a model feels are easily gameable by making it have a better personality.

It's not clear to me that personality is completely separate from capabilities, especially with inference time reasoning.

Also, what do you mean by "bigger templates"?

6Thane Ruthenis10d

Sorry, I meant "bigger sets of templates". See here:

4Chris_Leong10d

My intuition would be that models learn to implement more general templates as well.

[-]mrtreasure11d30

I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding.

We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and ... (read more)

5Mitchell_Porter10d

Right, I don't see why this can't go all the way to genius (von-Neumann-level) intelligence. i would be interested to hear arguments that it can't.

[-]uhbif194d20

It is cool, and I have believed something like this for a while. Problem is that Claude 3.5 invalidated all that - it does know how to program, understands stuff, and does at least 50% work for me. This was not at all the case for previous models.

And all those "LLL would be just tools until 2030" arguments are not baked by anything and based solely on vibes. People said the same about understanding of context, hallucinations, and other stuff. So far the only prediction that worked is that LLM gains more common sense with scaling. And this is exactly what is needed to crack its agency.

2Thane Ruthenis4d

Of note: I have never said anything of that sort, nor nodded along at people saying it. I think I've had to eat crow after making a foolish "LLMs Will Never Do X" claim a total of zero times (having previously made a cautiously small but nonzero number of such claims). We'll see if I can keep up this streak.

[-]R S6d20

I agree with this insofar as this has always been my default / 60% case

Selfishly I also hope this is how it plays out (for sake of my career)

I also believe that it is the mainstream view

But independently I think there's a 20 to 30% chance that this is it, singularity hits very soon

And I have to be prepared for that

[-]Vugluscr Varcharka9d20

My perception of llms evolution dynamics coincides with your description, additionally popping into attention the bicameral mind theory (at least Julian James' timeline re language and human self-reflection, and max height of man-made structures) as smth that might be relevant for predicting close future. I find both of them (dynamics:) kinda similar. Might we expect comparatively long period of mindless blubbering followed by abrupt phase shift (observed in max man-made code structures complexity for example) and then the next slow phase (slower than the shift but faster then the previous slow one)?

[-]IC Rainbow8d10

human-made innovative applications of the paradigm of automated continuous program search. Not AI models autonomously producing innovations.

Can we... you know, make an innovative application of the paradigm of automated continuous program search to find AI models that would autonomously produce innovations?

[-]akarlin4d0-2

Frontier LLM performance on offline IQ tests is improving at perhaps 1 S.D. per year, and might have recently become even faster. These tests are a good measure of human general intelligence. One more such jump and there will be PhD-tier assistants for $20/month. At that point, I expect any lingering problems with invoking autonomy to be quickly fixed as human AI research acquires a vast multiplier through these assistants, and a few months later AI research becomes fully automated.

8Thane Ruthenis4d

Human general intelligence. I think it's abundantly clear that the cognitive features that are coupled in humans are not necessarily coupled in LLMs. Analogy: In humans, the ability to play chess is coupled with general intelligence: we can expect grandmasters to be quite smart. Does that imply Stockfish is a general-purpose hypergenius?

Moderation Log