I basically agree with the descriptive model, but don't see that the conclusions follow.
For example:
The token bottleneck is real.
Sure, and so are limits like short term memory for humans. Doesn't stop us.
And the same applies to only using shallow heuristics - humans mostly do the same thing.
I'd guess that humans have a short term memory of much more than ten bits, though. My intuition is that human short term memory would be much more comparable to neuralese vectors than to single tokens or 4 mental "items".
But yeah, the conclusions were much more speculative than the model.
And I think insofar as humans are agentic, they apply some general purpose search instead of semi-randomly picking their heuristics? Maybe?
I'd guess that humans have a short term memory of much more than ten bits, though. My intuition is that human short term memory would be much more comparable to neuralese vectors than to single tokens or 4 mental "items".
Maybe, but I'd guess that's a difference of less than an order of magnitude - and it seems like the relevant question isn't only bits passed between circuits, since LLMs, even without reasoning, are autoregressive, so they can reason sequentially over multiple tokens. (And with reasoning, that's obviously even more true.)
insofar as humans are agentic, they apply some general purpose search instead of semi-randomly picking their heuristics?
To the extent that LLM agents are agents, they definitely do this too! And if we're talking about single-forward-pass reasoning, very few humans intentionally train their system 1 to do something better than semi-randomly follow patterns that worked before. (If you don't know what I'm referring to, see the discussion of firefighters not actually making decisions and the resolution of the debate about system 1 / system 2 in Thinking Fast and Slow.)
Yeah, I'm not sure in which ways the analogy works/doesn't work. (It's curious that for LLMs, tokens function as observations, as actuators, and as short-term/medium-term memory).
My intuition is still that it's a large benefit that human can maintain some large-ish latent state while reasoning, while an LLM chain of "thought" could be analogized to a row of copies of the same human standing in sequence, each person hearing what the person before them said, thinking for ¿a second? (100 layers at 5ms for neuronal firing), and then repeating what they heard to their successor, appending a single word. Caveats from sampling methods, e.g. beam search.
When I think about implementing general-purpose search this way my guess is that the bottleneck of squeezing anything into a single most informative token, and my intuition is that general-purpose search needs quite deep serial computations on a world model; or that it's maybe possible to implement it this way but the token bottleneck is very annoying.
If the token bottleneck is such a problem, then
If I were to argue against myself, I'd say "each token can add ~10 bits of optimization pressure, but each successive forward pass can add another ~10 bits of optimization pressure to the existing context window, so you end up with a lot of bits of optimization pressure anyway", then my skeptical answer would be questioning if bits of optimization pressure can be additive, and if so whether that's the case in this case.
I'd guess that humans have a short term memory of much more than ten bits, though.
LLMs aren't limited to only tokens as inputs though. They can also attend to internal states as long as they're in previous layers. This has limits to how much useful data can be passed from previous positions but it's way more than 10 bits.
But an LLMs' short-term memory between forward passes includes everything accessible via attention, not just the vertical slice in the current position. Treating the single 10-bit token as the full memory misses the vast majority of the inputs at any given layer.
For example, if an LLM makes a decision in an early layer at position n, it can reference that decision directly in any later layer in positions after n, without going through the tokens.
This is limited since there's only O(100) layers to work with, but it's a meaningful amount of memory.
I'm not sure what you mean here by "circuit alignment". Shallow circuits in the limit allow almost every possible behavior (imagine that every circuit is a literally gate in boolean-circuit-complete architecture, you can combine it into arbitrary programs). To extent in which alignment is happening here I would expect it to be very in-distribution?
Another sad news is that in this scenario we can't partially decipher LLM cognitive algorithms and use them to make legible AI, because here we basically have GOFAI-"if we had enough labor to encode it and not much of perfectionism to fret about shallow features".
How does your model account for non-trivial features like [here](https://arxiv.org/abs/2502.00873v1)?
I think that in reality there is some deep generalization happening, but by default "neural networks are lazy", so majority of out-distribution work is done by enormous amount of shallow circuits.
Speculation: episodic memory is decluttering of shallow circuits to free space for more generalized learning.
I think that in reality there is some deep generalization happening, but by default "neural networks are lazy", so majority of out-distribution work is done by enormous amount of shallow circuits.
Not sure what you mean by "deep generalization", but in general, I don't see how generalization is incompatible with shallow circuits. I haven't read the Tegmark paper you linked, but if it's something like Neel Nanda's grokking of modular arithmetic work, that circuit was also pretty shallow (one-layer transformer IIRC).
I think it's explainable by the fact that modular arithmetic is not very complicated. By deep generalization I mean like "semantically rich world model encoded in relatively small number of circuits". You can have world model encoded in large number of shallow circuits, but I think that my point about resulting in-distribution-only alignment probably stands.
I did some theory trying to figure out why this kind of thing might be true. Specifically, I was contrasting how natural selection produced systems with an extremely general intelligence (the learning algorithm of the human brain), whereas gradient descent tends to instill shallower and more tasks-specific circuits.
One reason might be that evolution generates mutations randomly, and then selects for utility across an entire lifetime. Highly general adaptations, like the learning algorithm in the brain, accrue more and more fitness advantage every time they're used. So evolution selects strongly for extremely general-purpose algorithms.
Gradient descent, by contrast, updates a network by tuning each weight in the network to improve performance on each individual training example (and then averaging those together for many examples). Because these updates aren't random, but rather locally optimal, you lose the chance to luck into updates that are less-than-optimal on any given training example, even if they'd prove extremely valuable if tested across a wide range of scenarios (a la an ultra-general learning algorithm). Gradient descent's locally optimal updates, as opposed to random mutations w/ selection over lifetimes, instead bias towards learning local structure.
I don't feel like this is a polished formulation of the theory, but something like this might help explain some differences in the character of evolved general intelligence in humans, and the apparently fragmented bags of heuristics learned by neural networks. (Of course, the stuff humans learn within a lifetime seems to have more of that "fragmented bag of heuristics" character; this is related to the reason lifetime learning is a better analogy to gradient descent than natural selection is.)
natural selection produced systems with an extremely general intelligence (the learning algorithm of the human brain)
&
evolution selects strongly for extremely general-purpose algorithms
Natural selection mostly produces rather narrow intelligences, and in this sense, the brains of humans (and other smart animals) are an outlier. Although I vaguely expect even bacterial cell biology to involve some computation-like stuff that would be difficult to train modern NNs to emulate veridically, especially when it comes to robustness of processes against various sorts of perturbations. A separate thing is the generality of evolution itself, if you frame it as an algorithm.
In general, I think I agree with your framing, but I would emphasize the component of slack / exploration lack of premature optimization.
@cdt, I don't understand what it is that you don't understand.
For the first 2 billion years of life's existence on Earth, the planet was covered by unicellular slime. It took another ~billion for animals to emerge and a few more hundreds of millions for complex brains capable of more flexible, general behavior, within-lifetime learning, and so on.[1] It seems likely that had Chicxulub not killed off the non-avian dinosaurs (or substitute some other major catastrophic event), human-par general intelligence would not have evolved.
cf. "complex active bodies", i.e., arthropods, cephalopods, and vertebrates, as the lowest bar
Thank you for taking the time to explain further. I had originally interpreted "narrow intelligence" strictly, but based on your lowest bar this would include the majority of animal biomass and the vast majority of its species.
I am not sure the extent to which contemporary species provide evidence for or against the algorithmic properties of evolved behaviour. I am also not sure how much ecological opportunity enables or prevents this. It's a good question and one I have not read in the literature before.
What do you mean by "algorithmic properties of evolved behaviour"?
I am also not sure how much ecological opportunity enables or prevents this.
I think a lot. Given how long it took for humans to evolve and how this seems to have been enabled by a bunch of "random" events, like an asteroid wiping out the dinosaurs.[1]
"Reasonably impressively smart" animals are common and "easy to evolve": cephalopods, corvids, parrots, elephants, primates, raccoons, cetaceans, elephantfish, etc. It seems to me that the three main things you need are (1) the/a right sort of body plan[2]; (2) a nearby niche that benefits from intelligence; (3) time to evolve intelligence ("niche stability"?). For something like humans to emerge, a greater number of unlikely factors have to align.
Although it's possible that the Cretaceous extinction (and maybe some other major or minor extinctions too) was a type of event that is, in general, very likely to happen. Something like: you have "dumb", more environmentally rigid macrofauna and "smart", more environmentally flexible microfauna. At some point, you're going to have a major ecological disruption, so the former mostly dies off, and the latter can fill in its niche. This seems like a plausibly common pattern. But I'm speculating.
I mean "body plan" to include "brain plan".
It is hard to describe evolution as "fast" or "slow" without a yard-stick. Often slow relative to ecological time, perhaps, but I don't understand the idea that 3 billion years to find learning is "slow".
What do you mean by "algorithmic properties of evolved behaviour"?
It is not clear to me that the behavioural properties of contemporary species provide much information as to the capabilities of evolution or how fast it can reach certain behavioural adaptations. Strong evidence of possibility, weak evidence of speed, almost-no evidence of impossibility. Similarly, the stretch of time it took to make humanity and the events it took to reach there are not really evidence of any of the difficulty of human adaptations.
I can agree that it looks like certain adaptations appear together re: body plan, but "appearance of nearby niches" and "environmental(?) stability" are controlled mostly by ecological factors like distribution and dispersal. So perhaps ecological opportunity controls this more than I anticipated. One thing that struck me while reading your reply was that general learning seems energy-intensive, so it would be dependent on available resource flux from the ecosystem, and this would push the evolution of learning later in time. But again, this is more a claim about ecological factors, and it's not clear to me what that says about "what natural selection produces" or "natural selection vs gradient descent". Thanks for the interesting thoughts.
It is hard to describe evolution as "fast" or "slow" without a yard-stick. Often slow relative to ecological time, perhaps, but I don't understand the idea that 3 billion years to find learning is "slow".
Clock time is a perfectly valid yardstick. An adaptation that takes evolution a few thousand/million years to find can be found much more quickly by a competent team of human biologists.
Another valid yardstick would be something like computational efficiency or even more generally efficient use of resources (other than time, which I just covered). Natural selection proceeds via blind generate-and-test.[1] With something like AlphaFold, you can do better.
I can agree that it looks like certain adaptations appear together re: body plan
It sounds to me like you misinterpreted what I was saying about the body plan. I meant (/ I should have taken time to clarify) that you need some sort of basic pre-adaptation (vaguely analogous to pre-training an LLM) to make use of such an opportunity, and one such pre-adaptation is having a body that is generally agile/adaptive and can evolve into even more adaptive forms capable of exploiting new niches. Compare arthropods vs earthworms.
but "appearance of nearby niches" and "environmental(?) stability" are controlled mostly by ecological factors like distribution and dispersal. So perhaps ecological opportunity controls this more than I anticipated.
Yep.
One thing that struck me while reading your reply was that general learning seems energy-intensive, so it would be dependent on available resource flux from the ecosystem, and this would push the evolution of learning later in time.
Yeah, that (or more generally, intelligence being expensive) is why you sometimes see lineages "reverting" to less intelligent/brainy forms, e.g., proto-molluscs into mussels. I think also sea squirts.
But again, this is more a claim about ecological factors, and it's not clear to me what that says about "what natural selection produces" or "natural selection vs gradient descent".
Yeah, so the claim can be refined to: It's unusual for ecological factors to produce conditions favorable for high intelligence, especially given that it's not even one specific condition, but rather a series of ecological conditions[2] that hand-holdably lead a lineage into a form at which it starts being capable of reshaping its environment into a stable form favorable to its survival.
Although what new genetic variants it can generate at any given point depends on the current genotype and genotypes may tend to get selected for being better at spawning potentially useful mutations (in the context of the rest of the genome being held constant), cf. https://en.wikipedia.org/wiki/Evolvability#Evolution_of_evolvability. So, it is "blind", but, in a sense, it slowly refines its "priors" or "generative biases".
Not necessarily a unique series of ecological conditions. It seems very unlikely that there's only one rough ecological-evolutionary pathway to general intelligence.
evolution generates mutations randomly
Evolution's timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters (and tweaks our bodies, but that's less important than evolution's effects on our brains). Most other parts of our brains are initialised randomly.
Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.
Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.
Right. This is why, at the end, I compared within-lifetime learning to gradient descent rather than in-context learning.
Evolution's timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters
This is true, although one of us must be misunderstanding the other somewhere, if you mean meant as an objection. Evolution is slow, and selects over an aggregate of many situations (where input is fed into the body and an action is selected). Gradient descent is fast, and selects over behaviors in individual situations.
A rough handle for the differences is that evolution tends to produce coarse-grained adaptations, due to mutations accruing more fitness advantage if they're used in many situations across a lifetime. Gradient descent produces fine-grained adaptations, due to updates being designed to improve performance on individual training examples, rather than randomly generated and then strongly selected if they're useful across many training examples.
"Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models."
Could you describe a bit more how you used Claude, how they ideation took place?
My fear is that I would start out with a fuzzy idea of a circuit lookup table, and then talk to Claude and it eventually convinced me that this has massive implications for alignment. But I remain highly skeptical of this, there is a high risk of deviating into vibe thinking. I think that your arguments at multiple point leave the realm of valid reasoning and draw these wide unsupported conclusions. This is an easy way AI assisted alignment might fail.
Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:
I for example don't concretely see what you are actually saying here, these circuits supposedly perform some aspect of some task and they are aligned each? aligned as in with the model spec, but this circuit does something like addition or some fact lookup?
I think this model is mostly correct, and also has implications for capabilities progress/the need to switch to another paradigm/overhaul parts of the current paradigm to reach wildly superhuman capabilities.
Again here i see an enormous conceptual leap, going from this very vague model to giving a very vague limitation of the current paradigm.
Another such leap (David Mannheim already posted this one):
For example:
The token bottleneck is real.
Sure, and so are limits like short term memory for humans. Doesn't stop us.
I would be careful about this vibe based thinking, increasingly one benefit of humans might that they are less sycophantic than LLMs even if they are as smart so don't take my critique too harshly here.
My memory is that I was thinking about this purely in my head for about 1½ months (with related thoughts over the past year), with occasional notes in my logseq, and then wrote a draft over the past week. After I'd written that draft, I asked some clarifying questions about details in a Claude session yesterday, and then shared the draft, upon which some editing from my side followed.
I also searched my Claude history for "computation in superposition", which returns unrelated queries.
However:
As for the other points, I hope I can address those (plus the other critical comments in this thread) in the next two days.
I flushed out a similar idea in (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
I agree that what LLMs are doing is something like what you described. But if those types of circuits can produce reasoning, which apparently they at least sort of can, the sum total behavior may be very different in kind from the local behavior that produces it.
In fact, I'd say this is also roughly what is going on with the human brain. Some pretty janky mechanisms produce each thought; but the thoughts work together to get some impressive stuff done. Humans would be nothing without system 2, what you call reasoning and don't address.
the circuits are selected by reinforcement learning, especially RLVR, to be composeable
I predict that you would also find such selection in the base models.
Sorta? My experience in playing around with base models is that the style of text being produced (and the theme-coherence of said text) depends strongly on the sampler, e.g. with DeepSeek-v3.1 base on OpenRouter the output was usually switching every ~20 tokens between different genres of text[1], and often switching back to a previous text genre in the context window (leading to an effect where English text was regularly interleaved with Chinese characters, code, and Cyrillic).
Llama3-405b-base instead either slowly drifts in text, but even then usually switches genre every ~100-100 tokens as if the current document had ended, but the genre of the new text is mostly unrelated to the previous content of the context window (maybe owing to the fact that Meta folk concatenated webpages for training?)
All these seem much reduced in RL{HF,AIF,VR}ed models.
After they'd fixed the model going into context collapse ~90% of the time.
I'm surprised you diagnose this an optimizer issue - we're talking about SGD/Adam/..., right? Sure, I would believe that what text a base model produces is heavily influenced by training details, since a base model is trained on internet text, while RL trains on what text the model produces. But whether the one circuit produces data that the other circuit can read sounds like a question of whether a model is good at the job it was trained for. So, in-distribution, I still expect the circuits that pre-training produced to be composeable.
This is super-interesting, thanks!
I wonder if this explains the observation that LLMs tend to have fragmented world models rather than “holistic ones” (e.g. the Fractured Entangled Representation Hypothesis, https://arxiv.org/abs/2505.11581 and other results in that spirit).
(though this source claims almost five bits per character!)
Did you read this source? The largest entropies mentioned (4.7, 4.76 ) are for the entropy of random characters, not English text.
On first read, I broadly agree with this model.
composeable, that is each circuit takes inputs of the same type as outputs of other circuits in the network
I think this is kinda enforced by having the residual stream, or specifically, having it as the main information highway flowing through the entire network?
Experiments where swapping nearby layers of a network has very little impact on performance also suggest this.
(Maybe this explains the some of the "sameness"/"slop"-factor of LLM outputs, the "semantic type" has to match?)
Why would you expect different "types" to help?
I think this is kinda enforced by having the residual stream, or specifically, having it as the main information highway flowing through the entire network?
I was talking about the kinds of tokens that are output, more in this comment. I mostly think of one forward pass as being one circuit, but there may be some structure in the internal information flow that I'm not privvy to.
Why would you expect different "types" to help?
I think this is mainly a question of whether there's capacity for circuits in the model to handle different kinds of text (like OCR errors in medieval manuscripts, usenet archive formatting details &c), vs. being mode collapsed. I guess more obscure text formats are less connected (correlated?) to circuits that are capable at solving complex problems, and fewer serial steps have to be performed on translating from one "textual ontology" into another one. (This is all counterbalanced by the need to pack as much information as possible into the next token, but my guess is that over time RLVR will add more structure/details/entropy to the Markdown-in-English chains-of-thought, instead of e.g. repurposing something like a circuit responsible for representing little bits of Yi script, which at least Llama3-405b-base can do.)
Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)
Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.
Discussions from the time either ruled out giant lookup-tables (Altair 2024):
or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):
The most fitting rejoinder to the observation of capable non-optimizer AIs is probably "Just you wait"—current LLMs are capable, sure, but they're not wildly superhuman to an extent comparable to the original worries about extreme optimization pressure. In this view, they're molting into full agency right now, and we should see the problems of high optimization pressure show up by the end of 2026 or the five years after
[1]
if they're not hidden from us by deceptive
AIs .
Indeed, current LLMs do
reward-hack,
though the developers have been decent at suppressing the tendency down
to a consumer-acceptable level.
But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:
LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.
One could call this the GLUT-of-circuits model of LLMs.
To elaborate:
Estimate on the circuit depth of gpt-3-davinci:
matmulof square matrix with size 12288, at logarithmic circuit depth formatmuls we get 13-14 stepssoftmaxover an array of lengthmatmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsmatmul, 13-14 stepsInferences from this model:
Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.
That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.
It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".
Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:
In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).
If this view is correct, a folk view of alignment as simply "deleting/downweighting the bad parts of a model" would be mostly correct: There would be a large but finite number of circuits embedded in the model, which can be upweighted, downweighted or outright deleted by gradient descent. My extremely speculative guess is that there is less than a quadrillion circuits stored in superposition in a trillion-parameter model , which thorough-enough safety
training could exhaustively or near-exhaustively check and select. In this view, AI alignment really would be purely bottlenecked on the amount of computation spent on whac-a-moling unaligned circuits.
People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.
Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").
Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":
Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.
Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?
I think this model is mostly correct , and also has implications
for capabilities progress/the need to switch to another paradigm/overhaul
parts of the current paradigm to reach wildly superhuman capabilities. I
think it predicts we'll see some gains from training but that we'll
plateau, or trade hard to measure capabilities for easily measurable
capabilities. I think I want to point to 55% "LLMs are agents" and 45%
"LLMs are stochastic parrots", but there's tons of AI capabilities
forecast questions I'm not yet able to answer (e.g. the infamous "which
concrete capability would you expect an LLM-based system not to have in
$YEAR?"). And plausibly the whole thing is just moot because long chains
of thought just enable enough chaining together to get the capabilities.
or smth idk
(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)
Related/prior work/inspirations:
Someone wrote a long report about this. ↩︎
The details of -sparsity are beyond the purview of this short note. ↩︎
Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎