I think this is right; LLMs are mostly crystallized intelligence, but do have fluid intelligence that lags humans substantially. And I agree that this is good for safety as long as it lasts - as long as it doesn't create an overhang. Which it well could.
A related perspective is that LLMs lack metacognitive skills. This is only somewhat overlapping with fluid intelligence, but it has the same implications.
Another related issue is that it multiplies the importance of fluid intelligence dramatically if you can never remember your solution to a new problem. Thus, adding any sort of effective learning during deployment would reduce the downsides of poor fluid intelligence, although more in efficiency than total capability.
I'm not sure the situation in which LLMs are not good at crystallised intelligence is good for AI Safety. Or at least, it is good in the sense that timelines are delayed by a few years but makes our odds of surviving an AGI worse.
A core component of modern technical AI safety is studying LLMs. Major figures in the field seem to be under the impression that their experiments will extrapolate to future models, at least to some extent. As a concrete example, the team at Google DeepMind endorses a safety approach in which "there will not be large discontinuous jumps in general AI capabilities"(pg 3). Authors on this paper include Rohin Shah, Neel Nanda and Victoria Krakovna.*
My concern is that, should LLMs be terrible at fluid intelligence and are fundamentally inefficient learners, then we'll see a temporary slow in progress. But this leads to a strong economic incentive to explore alternate architectures and at some point one is found which learns faster with less data and energy. (This is Yann LeCun's current agenda.)
The end result is we will see a sudden jump in AI capabilities that we are unprepared for.
*This isn't to single out DeepMind's team, their paper just came to mind.
Oof, yeah, seems overconfident.
I wonder if a similar error is why Ants seem so confident in a very fast takeoff -- they assume the models are better at fluid intelligence than they actually are, because their capabilities are strongest in the domain Ants are best at evaluating.
Very good, thank you for well-articulating this distinction, the crucial question around it, and a decent initial stab at some implications and consequences.
Provisionally, I think 'mostly crystallised' is right. Among the things that crystallise (especially from diverse agentic training) are something like crystallised heuristics for runtime exploration and in-context updating. I note that it'd be great to have readier evidence and trends on this!
This is why I think exploration capability and its complement, sample-efficient learning, are totally central to any question of the impacts of AI on R&D (of AI and of other things).
They’re good at ARC-AGI despite presumably not seeing this type of challenge before.
To nitpick, the ARC Prize foundation has found some odd signs of (maybe) memorization. Eg, Gemini 3's reasoning traces show it thinking:
… Target is Green (3). Pattern is Magenta (6) Solid. Result: Magenta Square on Green … (Gemini 3 Deep Think)
But the JSON it receives as input has no colors! It's clearly pretty familiar with the tests, even if it might not have seen a particular one before.
And while they can solve them, I'm not sure they're "good" (or human-level efficient) just yet.
LLMs typically use a lot of reasoning tokens. An early version of o3 scored 75% on ARC-AGI-1...but it spent $200 per task (!) doing so. That's an extreme outlier, granted. But all LLMs with humanlike scores (~80%) on ARC-AGI-2 are pretty expensive (typically a dollar to several dollars per task). The current best performer, GPT-5.5 on xHigh, costs $1.87/task. That's between 40k and 60k of reasoning tokens (half the length of the first Harry Potter book), for every single task—quite a lot for puzzles that humans can (mostly) solve at a glance.
To me, this indicates some degree of "brute-forcing" is still going on in ARC-AGI.
I broadly agree with the post as a whole.
Hm, I don't feel like I have good intuitions for what a GLUT-of-circuits could or couldn't do, so hard for me to assess if this is a good empirical fit. And I'm not technical enough to have a sense for what kinds of structures training is likely to produce. Doesn't seem like a crazy model.
Summary
LLMs are better at developing crystallized intelligence than fluid intelligence. That is: LLM training is good at building crystallized intelligence by learning patterns from training data, and this is sufficient to make them surprisingly skillful at lots of tasks. But for a given capability level in the areas they’ve trained on, LLMs have very weak fluid intelligence compared to humans. For example, two years ago I thought human-level SAT performance would mean AGI, but turns out LLMs can do great at the SAT while being mediocre at lots of other tasks.
I’m not saying LLMs are just parrots (that’s dumb).[1] There’s a continuity between crystallized and fluid intelligence.
Empirically, it’s unclear how fluid their intelligence is: we see both general reasoning skills and jaggedness.
It’s worth considering: what if fluid intelligence progress is relatively slow, and LLM capabilities mostly grow with relevant training data?
This could imply slower AI progress, especially if general-purpose data runs dry relatively soon. (Epoch estimates 2026-2032.) That means companies will need to prioritize specialized data collection/generation, which will lead to jagged capabilities growth favoring the prioritized areas.
[Epistemic status: I only put like 20% on worlds where this dynamic puts a serious damper on AI progress compared to e.g. the AI Futures Project’s median timelines. It’s important to stay aware of these possibilities, though, and track the relevant evidence.]
Implications for AI futures
This suggests that we shouldn’t naively extrapolate forward from e.g. the METR AI R&D benchmark to real-world AI R&D improvement, for two reasons:
Likewise, this suggests that simply scaling LLM training won’t get us to omni-competence.
But “just scaling LLMs” and “scale LLMs ‘til they’re superhuman AI R&D coders, then use those to build next-gen AI” are the two main stories for how we get to AGI very fast!
We should still expect significant progress on AI R&D. The AI labs are explicitly training for AI R&D, and have clearly hit superhuman capability in some coding-related areas (cybersecurity).
But the shape and speed of the takeoff curve matters. It matters a lot if, say, the METR time horizon hits 1 month, but we actually don’t have anything like a drop-in senior AI R&D researcher, just a really really good team of assistants. The labs still need to spend a bunch of serial time running compute-expensive experiments, and their AI tools can only moderately improve experiment selection. That could mean they get to, say, a 10x speedup over years of grueling effort. That’s much slower than AI2027 expects.
Crucially, for as long as AIs are great at technical work but mediocre at fluid intelligence, that’s great news for AI safety.
But a major caveat is: I expect at some point we’ll see people devise new paradigms that are more data-efficient, and at that point all our safety techniques and assumptions might no longer hold.
We should check if this is true!
I’d be really excited for tests of capabilities like:
Modeling worlds where AI progress is hungry for domain data
Here’s a set of claims, call this the “hungry for domain data” hypothesis:
What types of areas see progress in this model?
I imagine we’ll have a base AI optimized for AI R&D, which gets trained to develop synthetic-data sources for domains that are amenable to simulation and/or to automated evaluation (for RL). Then those data sources are used to train AIs.
Domains will see more progress if:
There are also stories for how advanced AIs could route around data bottlenecks:
Which concrete domains see progress?
Implications for AI takeoff
While it lasts, weak fluid intelligence is great news for alignment risk
Successful scheming might rely on very good general reasoning. Otherwise, you could do some combo of…
A key bifurcation point: can AIs revolutionize AI R&D, or merely speed it up?
Coding is a data-rich domain, and AI companies prioritize generating data on AI R&D tasks, so we should expect AIs to get better at AI R&D over time — as we indeed see.
Case 1: AIs can significantly improve coder productivity and codebase efficiency, but they don’t reach supremacy. R&D progress is gated on compute-hungry research experiments and on expert research taste.
Case 2: AIs are able to intuit key principles of AI R&D. This lets them move smoothly from usefulness to outright supremacy. The best human experts are great at this, but they’re held back by brains with limited working memory and poor native resources for understanding massive-dimensional spaces and inhuman minds.
In both cases, my best guess is that improved AI R&D eventually leads to a paradigm that can scale to superhuman fluid intelligence. And since resources and R&D productivity are scaling so rapidly, “eventually” will probably come pretty soon.
But in case 1 especially, we’re likely to see a period where AI architectures evolves a lot. That has important implications:
Is this the world we live in?
Some evidence against: It seems like the “water level” of LLM capabilities is gradually rising in many areas, and that some of this is probably a generalizable-skills thing.
How can we test this hypothesis?
Places to look for fluid reasoning capabilities in LLMs:
Thanks to K, Adria, John, and Abi for comments.
They’re more like a horde of precocious 12-year olds, each with a different hyperfixation.