LLMs as Giant Lookup-Tables of Shallow Circuits

niplav; Claude+

LLMs as Giant Lookup-Tables of Shallow Circuits — LessWrong

98 LLMs as Giant Lookup-Tables of Shallow Circuits

17th Mar 2026

8 min read

98

Early 2026 LLMs in scaffolds, from simple ones such as giving the model access to a scratchpad/"chain of thought" up to MCP servers, skills, and context compaction &c are quite capable. (Obligatory meme link to the METR graph.)

Yet: If someone had told me in 2019 that systems with such capability would exist in 2026, I would strongly predict that they would be almost uncontrollable optimizers, ruthlessly & tirelessly pursuing their goals and finding edge instantiations in everything. But they don't seem to be doing that. Current-day LLMs are just not that optimizer-y, they appear to have capable behavior without apparent agent structure.

Discussions from the time either ruled out giant lookup-tables (Altair 2024):

One obvious problem is that there could be a policy which is the equivalent of a giant look-up table it's just a list of key-value pairs where the previous observation sequence is the look-up key, and it returns a next action. For any well-performing policy, there could exist a table version of it. These are clearly not of interest, and in some sense they have no "structure" at all, let alone agent structure. A way to filter out the look-up tables is to put an upper bound on the description length of the policies […].

or specified that the optimizer must be in the causal history of such a giant lookup-table (Garrabrant 2019):

First a giant look-up table is not (directly) a counterexample. This is because it might be that the only way to produce an agent-like GLUT is to use agent-like architecture to search for it. Similarly a program that outputs all possible GLUTs is also not a counterexample because you might have to use your agent-like architecture to point at the specific counterexample. A longer version of the conjecture is “If you see a program impelements agent-like behavior, there must be some agent-like architecture in the program itself, in the causal history of the program, or in the process that brought your attention to the program.”

The most fitting rejoinder to the observation of capable non-optimizer AIs is probably "Just you wait"—current LLMs are capable, sure, but they're not wildly superhuman to an extent comparable to the original worries about extreme optimization pressure. In this view, they're molting into full agency right now, and we should see the problems of high optimization pressure show up by the end of 2026 or the five years after ^[1] if they're not hidden from us by deceptive AIs. Indeed, current LLMs do reward-hack, though the developers have been decent at suppressing the tendency down to a consumer-acceptable level.

But I have a different theory for how LLMs can be capable without being agentic/perciniously optimizing:

LLMs are superlinear-in-network-width lookup-table-like collections of depth-limited, composeable and error-correcting circuits, computed in superposition.

One could call this the GLUT-of-circuits model of LLMs.

To elaborate:

"lookup-table-like": A common issue in proving theorems about agent structure is to avoid including giant lookup tables where each sequence of previous percepts and actions is matched to an optimal next action. Such a giant lookup table is infeasible in the real world, but I think something similar to a giant lookup table might be possible, namely through computation in superposition.
"circuits": Turing machines aren't a great model for the type of computation that neural networks use (namely because they assume an infinite tape and arbitrary serial depth); a slightly better one is circuits, which have limited depth, though in computer science circuits are usually defined for booleans or integers.
"computed in superposition": Current LLMs represent features in superposition, that is, they exploit the fact that high-dimensional spaces can have exponentially many almost-orthogonal vectors relative to the number of dimensions (a consequence of the Johnson-Lindenstrauss lemma). The idea of computation in superposition generalizes this observation to cram more computation into a neural network.
"superlinear-in-network-width": Computation in superposition allows for running many circuits at the same time. E.g. Hänni et al. 2024 construct (shallow) neural networks of witdh that are able emulate an -sparse ^[2] boolean circuit of width (Theorem 8).
1. This result gestures at the possibility that one can put a superlinear number of circuits ^[3] into a single neural network, making the neural network more of a lookup-table than an example of general-purpose search.
"depth-limited": A forward pass in SOTA language models has a limited number of steps of serial computation; gpt-3-davinci had 96 layers, which results in a circuit depth of 78-83 steps per layer, impliying in serial steps per forward pass. Current LLMs might look different due to sparse attention and other developments in architecture, no to speak of increased model size, but I'd guess that current frontier models don't have a circuit depth of more than 20000, unless I really messed something up in my calculation. Claude Sonnet 4.5 estimates that Kimi K2.5 Thinking has slightly less than 5000 serial steps of computation, mostly due to being only 61 layers deep.
"composeable": If the neural network just contained an unrelated collection of circuits, the network wouldn't be effective at solving difficult math problems. Instead, my best guess is that the circuits are selected by reinforcement learning, especially RLVR, to be composeable, that is each circuit takes inputs of the same type as outputs of other circuits in the network. (Maybe this explains the some of the "sameness"/"slop"-factor of LLM outputs, the "semantic type" has to match?)
"error-correcting": If a forward pass executes many circuits in superposition, there will be some interference between individual circuits, so circuits will be selected to be either robust to small errors or error-correcting. This is oddly similar to results from singular learning theory, which imply that (very roughly) Bayesian inference selects for error-correcting programs. I don't know which implications this would have.

Estimate on the circuit depth of gpt-3-davinci:

Self-attention
1. projections: single step
2. Compute : matmul of square matrix with size 12288, at logarithmic circuit depth for matmuls we get 13-14 steps
3. softmax over an array of length has depth (summation of a list can be done via a summation in a binary tree), for a vector size of 2048 we get 11 steps
4. Multiply by , one matmul, 13-14 steps
5. Output projection, another matmul, 13-14 steps
Feed-forward
1. First linear layer: one matmul, 13-14 steps
2. GELU: single step
3. Second linear layer: another matmul, 13-14 steps

Inferences from this model:

Circuit selection: This model would imply that circuits are selected mostly by another algorithm with very small serial depth, relying on features of a problem that can be determined by very parallel computations.

That somewhat matches my observations from looking at LLMs trying to tackle problems: It often looks to me like they try one strategy after another, and less often use detailed information from the past failed attempt to form a complicated new strategy.

It also matches what we've seen from LLMs self-preserving/black-mailing/reward-hacking: The actions seem opportunistic, not carefully hidden once they've been performed, not embedded in larger plans; they look mostly like "another strategy to try, oops, I guess that didn't quite work".

Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:

Selecting one circuit from the giant lookup table that is all the circuits in superposition in the LLM
Running many circuits in superposition and selecting the best result or aggregating the best results.

In this model every circuit is individually "aligned" (insofar such a shallow program can be misaligned at all). Chain of "thought" composes calls to related circuits (though more on CoT below).

If this view is correct, a folk view of alignment as simply "deleting/downweighting the bad parts of a model" would be mostly correct: There would be a large but finite number of circuits embedded in the model, which can be upweighted, downweighted or outright deleted by gradient descent. My extremely speculative guess is that there is less than a quadrillion circuits stored in superposition in a trillion-parameter model, which thorough-enough safety training could exhaustively or near-exhaustively check and select. In this view, AI alignment really would be purely bottlenecked on the amount of computation spent on whac-a-moling unaligned circuits.

People might not spend enough compute on alignment training, and that would still be a problem (though a lesser one, since the model wouldn't be actively working against the developers), but the problem of alignment would've been turned from a category I problem into a category II problem.

Chain of thought: The obvious wrinkle in this story is that I haven't talked about chain-of-"thought"/"reasoning" LLMs. It goes with saying that long chains of thought enable vastly more serial computation to be done, and I haven't yet quite teased apart how this impacts the overall picture (besides "it makes it worse").

Still, some guesses at implications from the GLUT-of-circuits models for alignment and chain-of-"thought":

The token bottleneck is real. Every <10k serial steps the entire state needs to be compressed down to one of 50k-200k tokens, resulting in at most 16-20 bits of state to be passed between each circuit. My guess is that it's actually closer to ~8-10 bits (given ~1 bit per character in English), so maybe 2 bits per character in an optimized chain of "thought", at four to five characters per token. A vector of a thousand floats becomes a token of a dozen bits.
1. This may allow us to estimate an upper limit on the optimization power of an LLM outputting tokens?
Continuous chains of "thought" would be quite bad, since they'd increase the serial depth without information bottlenecks by a lot.
It now matters that all the circuits are aligned even when composed with each other, which is not guaranteed at all, and even having to extend the guarantees for alignment from every circuit to every ordered pair of circuits increases the size of the search space quadratically.
1. Though, I'm still not convinced this would give us ruthless optimizers. You gotta chain together a lot of short circuits in a semi-related fashion to result in strong optimization pressure/Goodharting/edge instantiation.

Most of the other things (backtracking in a forward pass is really hard &c) I'd otherwise say here have already been said by others.

Training process: If we see this whole model as being about amortized optimization, maybe it's the training process that takes up all the optimization power? Are LLMs the most dangerous during training, or is it rather the whole training process which is dangerous?

I think this model is mostly correct, and also has implications for capabilities progress/the need to switch to another paradigm/overhaul parts of the current paradigm to reach wildly superhuman capabilities. I think it predicts we'll see some gains from training but that we'll plateau, or trade hard to measure capabilities for easily measurable capabilities. I think I want to point to 55% "LLMs are agents" and 45% "LLMs are stochastic parrots", but there's tons of AI capabilities forecast questions I'm not yet able to answer (e.g. the infamous "which concrete capability would you expect an LLM-based system not to have in $YEAR?"). And plausibly the whole thing is just moot because long chains of thought just enable enough chaining together to get the capabilities.

or smth idk

(Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.)

Related/prior work/inspirations:

Someone wrote a long report about this. ↩︎
The details of -sparsity are beyond the purview of this short note. ↩︎
Hänni et al. 2024 prove polynomial scaling, the Johnson-Lindenstrauss lemma, taken too seriously, could imply exponential scaling. Polynomial scaling makes this picture more more feasible but less likely, since it's not clear a merely-polynomial number of circuits can deal with the complexity of the world with its exponentially growing observation-action sequences. ↩︎

Shard TheoryAgent-Structure ProblemComp-In-SupLanguage Models (LLMs)AI

Frontpage

98

New Comment

35 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:28 AM

[-]Davidmanheim2mo1915

I basically agree with the descriptive model, but don't see that the conclusions follow.

For example:

The token bottleneck is real.

Sure, and so are limits like short term memory for humans. Doesn't stop us.

And the same applies to only using shallow heuristics - humans mostly do the same thing.

[-]niplav2mo71

I'd guess that humans have a short term memory of much more than ten bits, though. My intuition is that human short term memory would be much more comparable to neuralese vectors than to single tokens or 4 mental "items".

But yeah, the conclusions were much more speculative than the model.

And I think insofar as humans are agentic, they apply some general purpose search instead of semi-randomly picking their heuristics? Maybe?

[-]Davidmanheim2mo60

I'd guess that humans have a short term memory of much more than ten bits, though. My intuition is that human short term memory would be much more comparable to neuralese vectors than to single tokens or 4 mental "items".

Maybe, but I'd guess that's a difference of less than an order of magnitude - and it seems like the relevant question isn't only bits passed between circuits, since LLMs, even without reasoning, are autoregressive, so they can reason sequentially over multiple tokens. (And with reasoning, that's obviously even more true.)

insofar as humans are agentic, they apply some general purpose search instead of semi-randomly picking their heuristics?

To the extent that LLM agents are agents, they definitely do this too! And if we're talking about single-forward-pass reasoning, very few humans intentionally train their system 1 to do something better than semi-randomly follow patterns that worked before. (If you don't know what I'm referring to, see the discussion of firefighters not actually making decisions and the resolution of the debate about system 1 / system 2 in Thinking Fast and Slow.)

[-]niplav2mo42

Yeah, I'm not sure in which ways the analogy works/doesn't work. (It's curious that for LLMs, tokens function as observations, as actuators, and as short-term/medium-term memory).

My intuition is still that it's a large benefit that human can maintain some large-ish latent state while reasoning, while an LLM chain of "thought" could be analogized to a row of copies of the same human standing in sequence, each person hearing what the person before them said, thinking for ¿a second? (100 layers at 5ms for neuronal firing), and then repeating what they heard to their successor, appending a single word. Caveats from sampling methods, e.g. beam search.

When I think about implementing general-purpose search this way my guess is that the bottleneck of squeezing anything into a single most informative token, and my intuition is that general-purpose search needs quite deep serial computations on a world model; or that it's maybe possible to implement it this way but the token bottleneck is very annoying.

If the token bottleneck is such a problem, then

Staying with readable chains of thought would hinder capabilities progress (so, the more rapid capabilities progress I see, the less I should be confident in the conclusions of this model)
There'd be a strong pressure on finding a way to make latent reasoning scalable
This would be even worse for alignment

If I were to argue against myself, I'd say "each token can add ~10 bits of optimization pressure, but each successive forward pass can add another ~10 bits of optimization pressure to the existing context window, so you end up with a lot of bits of optimization pressure anyway", then my skeptical answer would be questioning if bits of optimization pressure can be additive, and if so whether that's the case in this case.

[-]Brendan Long2mo42

I'd guess that humans have a short term memory of much more than ten bits, though.

LLMs aren't limited to only tokens as inputs though. They can also attend to internal states as long as they're in previous layers. This has limits to how much useful data can be passed from previous positions but it's way more than 10 bits.

[-]niplav2mo*40

I meant between forward passes. Not within a forward pass.

[-]Brendan Long2mo42

But an LLMs' short-term memory between forward passes includes everything accessible via attention, not just the vertical slice in the current position. Treating the single 10-bit token as the full memory misses the vast majority of the inputs at any given layer.

For example, if an LLM makes a decision in an early layer at position n, it can reference that decision directly in any later layer in positions after n, without going through the tokens.

This is limited since there's only O(100) layers to work with, but it's a meaningful amount of memory.

[-]quetzal_rainbow2mo1310

I'm not sure what you mean here by "circuit alignment". Shallow circuits in the limit allow almost every possible behavior (imagine that every circuit is a literally gate in boolean-circuit-complete architecture, you can combine it into arbitrary programs). To extent in which alignment is happening here I would expect it to be very in-distribution?

Another sad news is that in this scenario we can't partially decipher LLM cognitive algorithms and use them to make legible AI, because here we basically have GOFAI-"if we had enough labor to encode it and not much of perfectionism to fret about shallow features".

How does your model account for non-trivial features like [here](https://arxiv.org/abs/2502.00873v1)?

I think that in reality there is some deep generalization happening, but by default "neural networks are lazy", so majority of out-distribution work is done by enormous amount of shallow circuits.

Speculation: episodic memory is decluttering of shallow circuits to free space for more generalized learning.

[-]Mateusz Bagiński2mo40

I think that in reality there is some deep generalization happening, but by default "neural networks are lazy", so majority of out-distribution work is done by enormous amount of shallow circuits.

Not sure what you mean by "deep generalization", but in general, I don't see how generalization is incompatible with shallow circuits. I haven't read the Tegmark paper you linked, but if it's something like Neel Nanda's grokking of modular arithmetic work, that circuit was also pretty shallow (one-layer transformer IIRC).

[-]quetzal_rainbow2mo20

I think it's explainable by the fact that modular arithmetic is not very complicated. By deep generalization I mean like "semantically rich world model encoded in relatively small number of circuits". You can have world model encoded in large number of shallow circuits, but I think that my point about resulting in-distribution-only alignment probably stands.

[-]Fiora Starlight2mo135

I did some theory trying to figure out why this kind of thing might be true. Specifically, I was contrasting how natural selection produced systems with an extremely general intelligence (the learning algorithm of the human brain), whereas gradient descent tends to instill shallower and more tasks-specific circuits.

One reason might be that evolution generates mutations randomly, and then selects for utility across an entire lifetime. Highly general adaptations, like the learning algorithm in the brain, accrue more and more fitness advantage every time they're used. So evolution selects strongly for extremely general-purpose algorithms.

Gradient descent, by contrast, updates a network by tuning each weight in the network to improve performance on each individual training example (and then averaging those together for many examples). Because these updates aren't random, but rather locally optimal, you lose the chance to luck into updates that are less-than-optimal on any given training example, even if they'd prove extremely valuable if tested across a wide range of scenarios (a la an ultra-general learning algorithm). Gradient descent's locally optimal updates, as opposed to random mutations w/ selection over lifetimes, instead bias towards learning local structure.

I don't feel like this is a polished formulation of the theory, but something like this might help explain some differences in the character of evolved general intelligence in humans, and the apparently fragmented bags of heuristics learned by neural networks. (Of course, the stuff humans learn within a lifetime seems to have more of that "fragmented bag of heuristics" character; this is related to the reason lifetime learning is a better analogy to gradient descent than natural selection is.)

[-]Mateusz Bagiński2mo60

natural selection produced systems with an extremely general intelligence (the learning algorithm of the human brain)

evolution selects strongly for extremely general-purpose algorithms

Natural selection mostly produces rather narrow intelligences, and in this sense, the brains of humans (and other smart animals) are an outlier. Although I vaguely expect even bacterial cell biology to involve some computation-like stuff that would be difficult to train modern NNs to emulate veridically, especially when it comes to robustness of processes against various sorts of perturbations. A separate thing is the generality of evolution itself, if you frame it as an algorithm.

In general, I think I agree with your framing, but I would emphasize the component of slack / exploration lack of premature optimization.

[-]Mateusz Bagiński2mo52

@cdt, I don't understand what it is that you don't understand.

For the first 2 billion years of life's existence on Earth, the planet was covered by unicellular slime. It took another ~billion for animals to emerge and a few more hundreds of millions for complex brains capable of more flexible, general behavior, within-lifetime learning, and so on.^[1] It seems likely that had Chicxulub not killed off the non-avian dinosaurs (or substitute some other major catastrophic event), human-par general intelligence would not have evolved.

^{^}
cf. "complex active bodies", i.e., arthropods, cephalopods, and vertebrates, as the lowest bar

[-]cdt2mo10

Thank you for taking the time to explain further. I had originally interpreted "narrow intelligence" strictly, but based on your lowest bar this would include the majority of animal biomass and the vast majority of its species.

I am not sure the extent to which contemporary species provide evidence for or against the algorithmic properties of evolved behaviour. I am also not sure how much ecological opportunity enables or prevents this. It's a good question and one I have not read in the literature before.

[-]Mateusz Bagiński2mo20

What do you mean by "algorithmic properties of evolved behaviour"?

I am also not sure how much ecological opportunity enables or prevents this.

I think a lot. Given how long it took for humans to evolve and how this seems to have been enabled by a bunch of "random" events, like an asteroid wiping out the dinosaurs.^[1]

"Reasonably impressively smart" animals are common and "easy to evolve": cephalopods, corvids, parrots, elephants, primates, raccoons, cetaceans, elephantfish, etc. It seems to me that the three main things you need are (1) the/a right sort of body plan^[2]; (2) a nearby niche that benefits from intelligence; (3) time to evolve intelligence ("niche stability"?). For something like humans to emerge, a greater number of unlikely factors have to align.

^{^}
Although it's possible that the Cretaceous extinction (and maybe some other major or minor extinctions too) was a type of event that is, in general, very likely to happen. Something like: you have "dumb", more environmentally rigid macrofauna and "smart", more environmentally flexible microfauna. At some point, you're going to have a major ecological disruption, so the former mostly dies off, and the latter can fill in its niche. This seems like a plausibly common pattern. But I'm speculating.
^{^}
I mean "body plan" to include "brain plan".

[-]cdt2mo30

It is hard to describe evolution as "fast" or "slow" without a yard-stick. Often slow relative to ecological time, perhaps, but I don't understand the idea that 3 billion years to find learning is "slow".

What do you mean by "algorithmic properties of evolved behaviour"?

It is not clear to me that the behavioural properties of contemporary species provide much information as to the capabilities of evolution or how fast it can reach certain behavioural adaptations. Strong evidence of possibility, weak evidence of speed, almost-no evidence of impossibility. Similarly, the stretch of time it took to make humanity and the events it took to reach there are not really evidence of any of the difficulty of human adaptations.

I can agree that it looks like certain adaptations appear together re: body plan, but "appearance of nearby niches" and "environmental(?) stability" are controlled mostly by ecological factors like distribution and dispersal. So perhaps ecological opportunity controls this more than I anticipated. One thing that struck me while reading your reply was that general learning seems energy-intensive, so it would be dependent on available resource flux from the ecosystem, and this would push the evolution of learning later in time. But again, this is more a claim about ecological factors, and it's not clear to me what that says about "what natural selection produces" or "natural selection vs gradient descent". Thanks for the interesting thoughts.

[-]Mateusz Bagiński2mo20

It is hard to describe evolution as "fast" or "slow" without a yard-stick. Often slow relative to ecological time, perhaps, but I don't understand the idea that 3 billion years to find learning is "slow".

Clock time is a perfectly valid yardstick. An adaptation that takes evolution a few thousand/million years to find can be found much more quickly by a competent team of human biologists.

Another valid yardstick would be something like computational efficiency or even more generally efficient use of resources (other than time, which I just covered). Natural selection proceeds via blind generate-and-test.^[1] With something like AlphaFold, you can do better.

I can agree that it looks like certain adaptations appear together re: body plan

It sounds to me like you misinterpreted what I was saying about the body plan. I meant (/ I should have taken time to clarify) that you need some sort of basic pre-adaptation (vaguely analogous to pre-training an LLM) to make use of such an opportunity, and one such pre-adaptation is having a body that is generally agile/adaptive and can evolve into even more adaptive forms capable of exploiting new niches. Compare arthropods vs earthworms.

but "appearance of nearby niches" and "environmental(?) stability" are controlled mostly by ecological factors like distribution and dispersal. So perhaps ecological opportunity controls this more than I anticipated.

Yep.

One thing that struck me while reading your reply was that general learning seems energy-intensive, so it would be dependent on available resource flux from the ecosystem, and this would push the evolution of learning later in time.

Yeah, that (or more generally, intelligence being expensive) is why you sometimes see lineages "reverting" to less intelligent/brainy forms, e.g., proto-molluscs into mussels. I think also sea squirts.

But again, this is more a claim about ecological factors, and it's not clear to me what that says about "what natural selection produces" or "natural selection vs gradient descent".

Yeah, so the claim can be refined to: It's unusual for ecological factors to produce conditions favorable for high intelligence, especially given that it's not even one specific condition, but rather a series of ecological conditions^[2] that hand-holdably lead a lineage into a form at which it starts being capable of reshaping its environment into a stable form favorable to its survival.

^{^}
Although what new genetic variants it can generate at any given point depends on the current genotype and genotypes may tend to get selected for being better at spawning potentially useful mutations (in the context of the rest of the genome being held constant), cf. https://en.wikipedia.org/wiki/Evolvability#Evolution_of_evolvability. So, it is "blind", but, in a sense, it slowly refines its "priors" or "generative biases".
^{^}
Not necessarily a unique series of ecological conditions. It seems very unlikely that there's only one rough ecological-evolutionary pathway to general intelligence.

[-]StanislavKrym2mo10

evolution generates mutations randomly

Evolution's timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters (and tweaks our bodies, but that's less important than evolution's effects on our brains). Most other parts of our brains are initialised randomly.

Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.

[-]Fiora Starlight2mo40

Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.

Right. This is why, at the end, I compared within-lifetime learning to gradient descent rather than in-context learning.

Evolution's timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters

This is true, although one of us must be misunderstanding the other somewhere, if you mean meant as an objection. Evolution is slow, and selects over an aggregate of many situations (where input is fed into the body and an action is selected). Gradient descent is fast, and selects over behaviors in individual situations.

A rough handle for the differences is that evolution tends to produce coarse-grained adaptations, due to mutations accruing more fitness advantage if they're used in many situations across a lifetime. Gradient descent produces fine-grained adaptations, due to updates being designed to improve performance on individual training examples, rather than randomly generated and then strongly selected if they're useful across many training examples.

[-]Sodium2mo112

I flushed out a similar idea in (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

[-]niplav2mo72

Yup, that post looks very similar, maybe minus the computation-in-superposition part?

[-]Simon Lermen2mo90

"Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models."

Could you describe a bit more how you used Claude, how they ideation took place?

My fear is that I would start out with a fuzzy idea of a circuit lookup table, and then talk to Claude and it eventually convinced me that this has massive implications for alignment. But I remain highly skeptical of this, there is a high risk of deviating into vibe thinking. I think that your arguments at multiple point leave the realm of valid reasoning and draw these wide unsupported conclusions. This is an easy way AI assisted alignment might fail.

Alignment: My guess is that most or almost all of these circuits are individually aligned through bog-standard RLHF/Constitutional AI. This works because the standard problems of edge instantiation and Goodhart's law don't show up as strongly, because the optimization mainly occurs by either:

I for example don't concretely see what you are actually saying here, these circuits supposedly perform some aspect of some task and they are aligned each? aligned as in with the model spec, but this circuit does something like addition or some fact lookup?

I think this model is mostly correct, and also has implications for capabilities progress/the need to switch to another paradigm/overhaul parts of the current paradigm to reach wildly superhuman capabilities.

Again here i see an enormous conceptual leap, going from this very vague model to giving a very vague limitation of the current paradigm.

Another such leap (David Mannheim already posted this one):

For example:
The token bottleneck is real.

Sure, and so are limits like short term memory for humans. Doesn't stop us.

I would be careful about this vibe based thinking, increasingly one benefit of humans might that they are less sycophantic than LLMs even if they are as smart so don't take my critique too harshly here.

[-]niplav2mo70

My memory is that I was thinking about this purely in my head for about 1½ months (with related thoughts over the past year), with occasional notes in my logseq, and then wrote a draft over the past week. After I'd written that draft, I asked some clarifying questions about details in a Claude session yesterday, and then shared the draft, upon which some editing from my side followed.

I also searched my Claude history for "computation in superposition", which returns unrelated queries.

However:

I'd like to share the chat I had yesterday with you, if you're fine with that, and would appreciate feedback if I'm still falling into the trap of LLM mania. (I think this applies less here than with other things I've done). Not sure if the chat leaks identifying information somewhere, so I'd prefer to send it via DM.
This text had relatively little LLM involvement, except me getting feedback and incorporating it. But it could still be that interacting with LLMs has just made me worse at thinking/more willing to vibe-think on my own, which would be bad.
Man maybe I should just not talk to the LLMs for two weeks or so.
Upon re-reading, the text goes more into speculation/wild extrapolation the further it progresses, which are exactly the parts I added quickly and without hesitation upon the Claude feedback. :-/

As for the other points, I hope I can address those (plus the other critical comments in this thread) in the next two days.

[-]Seth Herd2mo82

I agree that what LLMs are doing is something like what you described. But if those types of circuits can produce reasoning, which apparently they at least sort of can, the sum total behavior may be very different in kind from the local behavior that produces it.

In fact, I'd say this is also roughly what is going on with the human brain. Some pretty janky mechanisms produce each thought; but the thoughts work together to get some impressive stuff done. Humans would be nothing without system 2, what you call reasoning and don't address.

[-]Gurkenglas2mo62

the circuits are selected by reinforcement learning, especially RLVR, to be composeable

I predict that you would also find such selection in the base models.

[-]niplav2mo*20

Sorta? My experience in playing around with base models is that the style of text being produced (and the theme-coherence of said text) depends strongly on the sampler, e.g. with DeepSeek-v3.1 base on OpenRouter the output was usually switching every ~20 tokens between different genres of text^[1], and often switching back to a previous text genre in the context window (leading to an effect where English text was regularly interleaved with Chinese characters, code, and Cyrillic).

Llama3-405b-base instead either slowly drifts in text, but even then usually switches genre every ~100-100 tokens as if the current document had ended, but the genre of the new text is mostly unrelated to the previous content of the context window (maybe owing to the fact that Meta folk concatenated webpages for training?)

All these seem much reduced in RL{HF,AIF,VR}ed models.

^{^}
After they'd fixed the model going into context collapse ~90% of the time.

[-]Gurkenglas2mo20

I'm surprised you diagnose this an optimizer issue - we're talking about SGD/Adam/..., right? Sure, I would believe that what text a base model produces is heavily influenced by training details, since a base model is trained on internet text, while RL trains on what text the model produces. But whether the one circuit produces data that the other circuit can read sounds like a question of whether a model is good at the job it was trained for. So, in-distribution, I still expect the circuits that pre-training produced to be composeable.

[-]niplav2mo20

Oh, oops, yeah, I mean "sampler", not optimizer. Let me correct this.

[-]Gurkenglas2mo20

Then I am very confused: How do you get from "text extrapolated from base models depends on the sampler" to "base model circuits are less composeable"?

[-]mishka2mo68

This is super-interesting, thanks!

I wonder if this explains the observation that LLMs tend to have fragmented world models rather than “holistic ones” (e.g. the Fractured Entangled Representation Hypothesis, https://arxiv.org/abs/2505.11581 and other results in that spirit).

[-]Donald Hobson2mo40

(though this source claims almost five bits per character!)

Did you read this source? The largest entropies mentioned (4.7, 4.76 ) are for the entropy of random characters, not English text.

[-]niplav2mo20

Oops, I skimmed it, will edit this out. Thank you for flagging.

[-]Mateusz Bagiński2mo40

On first read, I broadly agree with this model.

composeable, that is each circuit takes inputs of the same type as outputs of other circuits in the network

I think this is kinda enforced by having the residual stream, or specifically, having it as the main information highway flowing through the entire network?

Experiments where swapping nearby layers of a network has very little impact on performance also suggest this.

(Maybe this explains the some of the "sameness"/"slop"-factor of LLM outputs, the "semantic type" has to match?)

Why would you expect different "types" to help?

[-]niplav2mo40

I think this is kinda enforced by having the residual stream, or specifically, having it as the main information highway flowing through the entire network?

I was talking about the kinds of tokens that are output, more in this comment. I mostly think of one forward pass as being one circuit, but there may be some structure in the internal information flow that I'm not privvy to.

Why would you expect different "types" to help?

I think this is mainly a question of whether there's capacity for circuits in the model to handle different kinds of text (like OCR errors in medieval manuscripts, usenet archive formatting details &c), vs. being mode collapsed. I guess more obscure text formats are less connected (correlated?) to circuits that are capable at solving complex problems, and fewer serial steps have to be performed on translating from one "textual ontology" into another one. (This is all counterbalanced by the need to pack as much information as possible into the next token, but my guess is that over time RLVR will add more structure/details/entropy to the Markdown-in-English chains-of-thought, instead of e.g. repurposing something like a circuit responsible for representing little bits of Yi script, which at least Llama3-405b-base can do.)

[-]Linda Linsefors1mo20

You are wildly overestimating how much computation in superposition is possible.

The bottleneck for comp-in-sup is not fitting the info into the neurons, but fitting the circuits into the weights.

I think your circuit depth calculations assume that each circuit get's to use the entire weight matrix at each step. That's not how this works.

Despite this I over all agree with your picture (but disagree with the implications). I expect may shallow circuits. However you seem to expect that most circuits are mostly end to end, maybe the network picks one or computes all and choses which output to use at the end? I expect the circuits to be much shorter than that, and much more composable. I.e. there are many circuits in cereal, and each output determine which one to trigger next.

I think this structure can be very agentic, when large enough.

Moderation Log