Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

This is a somewhat living document as I come back and add more ideas.

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

  • A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
    • It would be useful to treat heuristics as the fundamental object of study in interpretability as opposed to features.
  • By “All there is,” I claim that a bag of heuristics is a useful model for neural network computation. Neural networks generalize when it is able to combine learned heuristics in ways not seen in the training data.[1]
    • Note that this doesn’t mean that LLMs aren’t doing some form of search or planning, but rather that it would be useful to think about the search/planning process as being implemented through heuristics.
  • By “All you need” I mean that learning lots of heuristics and how to combine them is all you need to get to AGI and beyond.
    • I’m less confident about the AGI part, but I am fairly confident that we can get more powerful models through scaling, and that scaling is mostly about learning more heuristics and composing them. We can probably get much more powerful models that still mostly rely heuristics-based computation.
    • If this is true, then we can answer theoretical alignment questions and forecast future capabilities by studying how heuristics are learned and combined as we scale models. 

Why would you want to use the heuristics-based framework when thinking about neural networks?

  • I think it’s probably a good abstraction and accurately captures what the network is doing (see the Empirical Studies related to the hypothesis section).
    • In “using existing interpretability techniques to discover heuristics”, I propose some experiments to test the usefulness of this framework. I would love to hear your feedback on this post, but I'm especially interested in people's ideas on how to test this hypothesis.
  • Heuristics-based explanations are fundamentally algorithmic, which could allow it to sidestep some issues related to causal interpretations (e.g., multiple redundant causes of the same behavior)
  • Treating heuristics as the fundamental unit in interpretability as opposed to features:
    • Focuses us on the functional computation done by the neural network component and ensures that whatever we find is relevant to “how the model is doing this task.”
    • Like features, they are also discrete and independent from the rest of the network. For example, we can investigate the heuristic “if this MLP layer sees ‘Michael Jordan,’ writes ‘Professional Basketball Player’ to the occupation space of the residual stream” without needing to trace out the entire causal graph/circuit.
    • Intuitively, a single matrix multiplication or MLP layer can't be implementing  any algorithm that's super complicated.
    • Counterpoint: the main bottleneck is probably distributed computations. That is, multiple parts of the network work together to complete what to humans feels like a single function. (see e.g., the work on fact finding, and appendix G here)
  • While I might be adding complexity by introducing yet another abstraction, I’ve found it useful in the past to have multiple frameworks to apply to a problem to tease it from different angles. I think adding a new idea probably does more benefit than harm.
  • I believe that breaking down forward passes into functions that fit our vague notion of “heuristics” is likely the easiest path to fulfill our vague high-level goal of “understanding why a neural network did a thing.”
    • This approach can look the same as a circuit-style analysis operationally, but I have some ideas on exactly how to draw these circuits through the lens of the heuristics hypothesis.

Feel free to jump around this post and check out the sections that interests you. Each section is mostly independent of the others. 

How can interpretability win if the hypothesis is true?

I want to first clarify something I do not think we need to win: a one-to-one mapping between neural network computation and heuristics. I believe that we can have multiple acceptable heuristics-based explanations for a given forward pass (i.e., a one-to-many map). Any explanation that fits the following criteria—mostly copied from the original IOI paper—would be sufficient for “understanding why the model did what it did.” 

  • Faithful: They correctly represent the underlying computation the model does.
  • Complete: They capture all of the computation used by the model.
  • Minimal: They do not capture any more than the needed heuristics.
  • Comprehensible: We can understand what each heuristic does, and we can understand how all of the heuristics work together (likely with AI help).

I believe a bag of heuristics is the easiest way to fulfill these four criteria on arbitrary inputs. 

Corollary: Understanding neural network computation does not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics

A central way people evaluate sparse autoencoders (SAEs) is whether they find a set of “true” features. Researchers have varying intuitions on what true features should be, but a common theme is that they should be atomic (i.e., not composed of linear combinations of other features). This has led to people worrying that the sparsity term in the SAE loss leads to models combining commonly occurring atomic features into a single one (e.g., a red triangle feature instead of a red feature and a triangle feature, see also the recent work on feature absorption).

While learning intermediate variables in neural networks is a useful subgoal, I’m worried that the pursuit of atomic features—especially given that we can already get some sort of feature decompositions—is not the most productive task we could work on right now.

We should only care about features insofar as they are the inputs and outputs of heuristics/circuits, and we should only care about monosemanticity insofar as it helps us understand the network. If our heuristic decomposition is faithful, complete, and minimal, it doesn’t matter if individual heuristics take non-atomic concepts as inputs as long as we humans can understand the composed concept (likely given AI aid). 

Weak to strong winning

Here are various degrees of winning if the heuristics hypothesis is true. 

Weak victory: We can decompose every forward pass into heuristics composed with each other. That is, we can throw away the rest of the activations and use only the heuristics to reconstruct the input-output relationship to a high fidelity.

  • Perhaps we’ll get something on the order of 10^3-10^6 heuristics[2] per forward pass, which we can use LLMs to disentangle.
  • I believe this is the weakest version of “explain why the network did what it did.”
    • For example, getting a list of heuristics for a specific forward pass doesn’t have to tell us anything about how the model would act if the inputs were different.
  • Still, this feels like an ambitious goal, and even partial successes could be useful for auditing/Mechanistic Anomaly Detection/general science of interpretability.
    • This is especially true if we can learn about how neural networks complete tasks that we currently do not know how to write algorithms for or even for humans to complete themselves (see, e.g., learning chess concepts from AlphaZero, or the artificial artificial neural network in curve circuits)

Medium victory: In addition to individual forward passes, we understand sets of heuristics that a model uses to solve what humans can think of as “tasks.” That is, we understand all heuristics that handle a certain class of inputs (e.g., the IOI circuit).

  • This is equivalent to having causal abstractions[3] for tasks that are robust for all variations of said task.
  • The distinction between weak and medium victory is also discussed in Mueller 2024, who writes:
    • If one’s goal is to understand how a model will generalize, one should also consider at least some local causal dependencies. However, if one’s goal is merely to understand which components will directly affect downstream performance (e.g., when editing or pruning models), it may suffice to only include components that directly affect the output…the Pareto frontier may simply consist of the minimal number of features needed to understand whether a model is making the right decisions in the right way.

Strong victory: We know every heuristic in the model and how they compose, which is analogous to the “reverse engineer a neural network” end goal.

Miscellaneous thoughts on interpretability with heuristics hypothesis

Interpretability with heuristics is not very different from existing circuits analysis. The main ideas I came up with is the focus on heuristics as the key unit of analysis and being explicitly OK with many different potential explanations/levels of abstraction. As a result, it’s not clear if there’s anything major that we need to do differently. Sparse feature circuitstranscoders, and automated circuit discovery techniques already popular in the literature seem to be reasonable ways to proceed even if our end goal is a set of heuristics. 

However, given that a weak victory does not require an enumeration of all features/heuristics, it might be worth the time to try to discover more compute efficient ways to understand a single forward pass.

I also haven’t defined what a heuristic is because I’m genuinely not sure what the best level of abstraction would be. Here are some types of simple functions that I would consider as a “heuristic”

  • Any sort of Boolean or arithmetic operation
    • For example, “If I see the dog ears feature and the dog snout feature I will output +5 to the dog feature direction”
  • Any lookup/if-then statement
    • For example, f(Location of the Eiffel Tower) = Paris[5]
  • The embedding/unembedding matrices, which I see as trivially interpretable heuristics that map tokens to activations and activations to token logits.

Let me know if you have any other ideas.

A few more thoughts on verifying how correct the heuristics-based explanations are. I think there are two levels, the heuristics level, and the model level. At the heuristics level, we want to make sure that each individual heuristic is faithful to the underlying neural network computation. Ideally this could be done at the weight level, but we can also apply our bag of existing interpretability techniques. 

At the model level, my hope is that we can use our interpretability techniques to discover new algorithms in the form of compositing heuristics that we don’t know how to write. One of my first memorable interactions with ChatGPT was when I asked it to help me rephrase some survey questions I was working on, and it was actually really helpful. We currently have no idea how to write down a program to do that! Learning all the heuristics involved for various tasks could be a path towards some form of Microscope AI. And, as is the case with circuits analysis, these algorithms fall out naturally once we construct the heuristics.

What does it mean for alignment theory if the heuristics hypothesis is true?

(I’ve spent orders of magnitude less time and effort on this section compared to the interp section, but I figured I’d mention a few ideas and collect some feedback. If people actually like this hypothesis I’ll spend some more time thinking through this)

I’m not super sure if the heuristics framework alone could make concrete predictions on key aspects of alignment theory. You can approximate any function arbitrarily closely with heuristics. In other words, as systems advance, any sort of high-level behavior could emerge even if it’s all heuristics operating below (see, e.g., Interpretability/Tool-ness/Alignment/Corrigibility are not Composable, which is also a problem when we aggregate heuristics from each layer together). 

However, the more powerful future model you’re worried about won’t just fall out of a coconut tree.[6] We need to understand its learning process and how it became powerful. 

If learning more heuristics is all we need to get to more and more powerful systems, we should understand what types of heuristics and heuristics composition are learned first. Two relevant papers that come to mind are the quantization model of scaling, and work on which concepts are learned first in toy models by Park et al.. Work done here could help us understand if we’ll see, for example, capabilities generalization without alignment generalization. Generally though, it would be cool to see how alignment related concepts are learned and used compared to non-alignment related concepts.

On the surface, it seems like shard theory is more likely to be correct in the world where the heuristics hypothesis is true, although shards are higher-level abstractions compared to heuristics. I’d want to see some more concrete interpretability findings before making a strong claim though.

One opinion that I hold a bit more strongly after thinking through this post is that we could continue to get very economically useful models that are nonetheless incoherent in other ways. In the heuristics world, there's less reason to believe in discontinuous jumps in performance, and more reason to believe that AIs will get really good at some things while still bad at others (see also Boaz on the shape of AGI). 

Empirical studies related to the heuristics hypothesis (both in support and against)

  • Sanity check: Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks shows that transformers can learn heuristics and compose them in ways not seen in the training data.
  • OthelloGPT learned a bag of heuristics: OthelloGPT predict legal next moves in the board game using a bag of heuristics.
    • Maybe we would want to just to say that OthelloGPT has learned the world model for Othello and abstract away from the heuristics-based implementation. However, the real world is complicated and frontier models would likely have imperfect world models and human preference models, in which case it would be important to look at the heuristics-based implementations to understand the deficiencies.
  • The heuristics hypothesis is consistent with the fact that SAE features have some meaningful geometrical properties. If the model is made up of a bunch of heuristics, we could see specialized geometrical structures for certain classes of heuristics, such as a circular representation for the days of the week, or a linear representation for years that also serve as a timeline for historical events.
  • Recent paper on lookahead in the LeelaChess and the ensuing discussion on whether LeelaChess’s policy network can “see” several moves ahead or have merely learned heuristics for multi-move patterns.
    • I’m personally not sure if the model learned generalized notions of look-ahead or specialized multi-move patterns (which would imply that L12H12 only moves information from future and past board states in learned multi-move patterns).
    • Nonetheless, I think we can use a set of heuristics to break down the model’s thought process in either case (assuming that we can find those heuristics in the first place).
    • See also: Planning behavior in a recurrent neural network that plays Sokoban
  • We’ve found literal maps of the world in language models. In other words, networks clearly have some sort of “world model abstraction.”
    • As previously mentioned, I think the world model is probably going to be implemented as a series of heuristics. If we did come up with a way to decompile forward passes into 10^3-10^6 heuristics, we could probably use some LLM agent to classify some subset of those heuristics to the ones that build out the world model, and then use that higher-level abstraction to sort of “summarize” those set of heuristics. This is cheating a bit because now it no longer sounds like heuristics is all there is, but to me it’s more of like, “we’re not gonna worry about this set of heuristics because we know what they do and even if it’s wrong it’s not safety-relevant”
    • And if the part of the world model is safety relevant, back into heuristics land we go.
    • See also the globe they found in llama-2
  • This story is also consistent with the quantization model of neural scaling. In our hypothesis, the Quantas for language modeling would be either 1. learning a new heuristic or 2. finding a way to compose two unconnected heuristics together. I also think the gradient clustering approach used in the quantization hypothesis is a promising path to uncover heuristics.
  • There is also some earlier work in deep learning that demonstrates how neural networks tend to learn shortcuts for various tasks.

We need to keep in mind that the streetlight effect is certainly contaminating our evidence. That is, simple heuristics are easier for interpretability researchers to recover than complex data structures, and we should expect more evidence for them. 

It’s also cool to think through some other, general neural network phenomenon with the heuristics hypothesis in mind. It makes sense for the network to have some sort of redundancies (e.g., backup name mover heads) if there are similar heuristics learned at the same time. It makes sense that you could get the network to output whatever arbitrary text you want with an optimized string, since you can activate a weird set of heuristics and compose them. As heuristics compose from one layer to the next, they would need intermediate variables to communicate their results. Thus it makes sense that activation engineering works well across a wide range of concepts.

(There are some other results that come to mind which makes less sense, e.g., the 800 orthogonal code steering vectors, although I think Nina Panickssery’s explanation, if true, would be consistent with the heuristics hypothesis)

Weaknesses in the Heuristics Hypothesis

Some versions of the hypothesis are unfalsifiable

This theory’s biggest weakness is that we can decompose basically anything into a bag of composing heuristics given an infinite bag size. In other words, the heuristics hypothesis is technically consistent with every single hypothesis of how future systems would behave.

I do feel like this theory “explains too much.” However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.

The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.

The best way to locate heuristics might start with trying to find the most monosematic/atomic features, understand their functional implications, train transcoders, or following something like Lee Sharkley’s sparsify agenda. In other words, the new framing doesn’t add much. It’s also possible that we wouldn’t be able to achieve weak victory on a given task without understanding the whole task family, in which case the idea of a weak victory doesn’t really matter.

Perhaps this is true, but I think it’s worth thinking through this some more. I’m worried that the field has focused on features mostly due to path dependence from the original circuits thread that posited features as “the fundamental unit of neural networks” (although certainly not all researchers are focused on features, see e.g. Geiger’s causal abstractions agenda). Also, training SAEs that catalog all the features of a model is expensive and unnecessary for the weak victory condition I mentioned above. Trying to find cheaper interpretability techniques that are just meant to understand individual forward passes seems like a worthy thing to try.

I’m not super sure if that’s true? I think it’s reasonable to assume that two different sets of heuristics would be active in the case where the model is deceptive versus not deceptive, even conditional on the final token logits looking the same.

Inspirations and related work that I haven’t already mentioned

  • I envisioned most of this post before reading Lewis Smith’s most excellent post, The ‘strong’ feature hypothesis could be wrong, which makes many related points. A few relevant passages include:
    • I worry that there is a conceptual problem here, especially if the focus is on cataloging features and not on what the features do.
    • In other words, the picture implied by the strong [linear representation hypothesis] and monosemanticity is that first features come first, and then circuits, but this might be the wrong order; it’s also possible for circuits to be the primary objects, with features being (sometimes) a residue of underlying tacit computation.
  • Rohin Shah seems to had (have?) a similar view. He wrote in a post four years ago:
    • In particular, current systems trained by RL look like a grab bag of heuristics that correlate well with obtaining high reward. I think that as AI systems become more powerful, the heuristics will become more and more general, but they still won't decompose naturally into an objective function, a world model, and search.
  • Hidenori Tanaka’s group has several really interesting papers on concept learning and generalization, with some experiments on toy models. Their framing of generalization as concept composition was very inspiring
    • I think they are totally underrated.
  • I’ve already mentioned stuff on causal interpretability, but I’ll also cite this survey on the subject. Similar to Tanaka group’s work, I think causal interpretability probably deserves more attention on lesswrong.
  • I just saw this post from October 2022, Help out Redwood Research’s interpretability team by finding heuristics implemented by GPT-2 small. My guess is that they've stopped trying to pursue this project?
    • They focus more on scenarios where the entire network acts as a heuristic and completes a task, as opposed to locating specific heuristics in parts of the network.
  • Learning heuristics could hopefully also help us tackle some of the issues raised in Aaron Muller’s paper on Missed Causes and Ambiguous Effects such as missing redundant causes.
  • I approached the “understand what models are doing” problem from perhaps a more curiosity-driven framework than a applications-driven perspective (see, e.g., Stephen Casper’s excellent engineer’s interpretability sequence). I’d love to spend more time thinking about applications, but figured that I should share the conceptual stuff first.
  • There are also other works on neural network scaling and learning (e.g., A Dynamical Model of Neural Scaling Laws, A Theory for Emergence of Complex Skills in Language Models), but those tend to not lend themselves to interpretability analysis.
  • This project initially began as an attempt to formalize heuristics as the best form of abstraction for neural network computation. I don’t think I made that much progress towards that goal. 

Potential next steps

(Yet another section that I wrote rather quickly in the interests of getting some more feedback. I’m also ~60% sure that my specific research interests will shift in the next six months)

I can see four major directions for further exploration of the heuristics hypothesis

Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?

Although this is a fairly fundamental question, I’m not super worried about needing to get this completely right before trying to look for heuristics in novel settings. I think we can make a lot of progress even with imperfect definitions. Still, applying the heuristics perspective to circuits we already understand (e.g., IOI) and trying to formalize what exactly heuristics are and aren’t seems useful.

Creating new interpretability methods that are centered around heuristics as the fundamental unit

This is speculative, but it might be worth spending some time to figure out if there are ways to directly study heuristics as their own unit. Distributed alignment search (DAS) is the closest idea that comes to mind, but (to my understanding, I could be wrong, sorry!) DAS is a supervised method that requires the researcher to have some causal model in mind before trying to find it in the neural network. Transcoders represent another attempt, but those require cataloging all features in the training data.

The worry is that the field got locked into looking for features and feature circuits for mostly path-dependence reasons, and there could be some low hanging fruit if we just thought harder about heuristics, especially given the recent evidence that they might play a big role.

Using existing interpretability tools to discover heuristics

This is a much more tractable option to better understand heuristics, especially given the similarities between heuristics and circuit building. 

We can try to catalog individual heuristics manually by coming up with natural language tasks where we believe that the model would need to execute some heuristic at one point. By studying various individual heuristics, it could also help inspire specialized interpretability techniques to uncover them en masse. For example,

  • Any sort of task that requires computing an AND or OR between two concepts.
  • Situations where information from some specific token has to be moved to a later token.
  • This is not directly related, but I’m wondering if the MLP layers would be “linear” (i.e.,  MLP(x + f) = MLP(x) + MLP(f)) in some sense, and if so when.
    • For example, we wouldn’t expect this if the MLP is calculating some boolean function, but if it’s just doing factual retrieval that seems more likely. 

We could also leverage the existing SAE and treat features as inputs/outputs of heuristics. In this case, I’m hoping to advance beyond the gradient based attributions used in studies such as the Spare Feature Circuits paper. We can perhaps use gradient attribution to narrow down on the nodes and edges that we care about, but then focus on how, operationally, each edge is formed. The gradient attribution gives us only if-then relationships. Is that what’s locally happening with the model? 

Applying the heuristics-framework to study theoretical questions in alignment.

If we decide that the heuristic model of computation is true/useful, I’d be most excited to use it to study more theoretical topics and perhaps use it to forecast where future capabilities gains could come from. For example, Alex Turner said (two years ago):

I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives)

For example, we could study the dynamics of heuristics learning and composition in real world models, especially heuristics related to turning base models into assistants. One guess is that RLHF is sample efficient because it mostly changed how heuristics are composed with each other (and maybe boost existing heuristics to be more active), which might be a lot easier than learning new heuristics.[7] This would build on top of work done on toy models by Hidenori Tanaka’s group, and also maybe the quanta scaling hypothesis.

I’m currently trying to get into the AI safety field and will also be applying to MATS. Let me know if you’re interested in chatting more about any of these topics. Have a low bar for reaching out.

This post benefited from the feedback from Jack Zhang, Joe Campbell, Mat Allen, Tim Kostolansky, Veniamin Veselovsky, and woog. All errors are my own.

  1. ^

    This definition of generalization comes from Okawa et al. (2023)

  2. ^

    Source? I made it up

  3. ^

    I really struggled to understand this paper :( Would be down to go through it with someone.

  4. ^

    [Citation needed]

  5. ^

    Funny story but I almost wrote down Rome. The real rank one model editing is the one they did to my brain.

  6. ^
  7. ^

    Counterpoint: maybe learning new heuristics is easy and frontier models just have a good ability to learn by the time they’re done with pretraining.

New Comment
17 comments, sorted by Click to highlight new comments since:

I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).

I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so it isn't actually explaining anything even if it feels like it is). You don't seem to think this is a fatal flaw. Why?

You seem to answer

However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.

But I don't see why "heuristics based decompositions will be human-understandable" is an implication of the theory. As an extreme counterexample, logic gates are interpretable, but when stacked up into a computer they are ~uninterpretable. It looks to me like you've just tacked an interpretability hypothesis onto a heuristics hypothesis.

Thanks for reading my post! Here's how I think this hypothesis is helpful:

It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.

Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.

I think the problem might be that you've given this definition of heuristic:

A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.

Taking this definition seriously, it's easy to decompose a forward pass into such functions.

But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific words.

Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.

Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions. 

If you don't go into this level of detail, it's easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn't. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.

I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it's probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn't actually making a lot of concrete predictions yet at this stage). 

I don't think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that "we can understand neural network computation by doing mech interp."

A useful thread on the bag of heuristics versus actual noisy algorithmic reasoning has some interesting results, and the results show that at least with COT added to LLMs, LLMs aren't just a bag of heuristics, and do actual reasoning.

Of course, there is still pretty significant bag of heuristic reasoning, but I do think the literal claim that a bag of heuristics is all there is in LLMs is false.

You've claimed that it would be useful to think about the search/planning process as being implemented through heuristics, and I think this is sometimes true that some parts of search/planning are implemented through heuristics, but I don't think that's all there is to an LLM planning/searching, either now or in the future for LLMs

The thread is below:

https://x.com/aksh_555/status/1843326181950828753

The paper "Auto-Regressive Next-Token Predictors are Universal Learners" made me a little more skeptical of attributing general reasoning ability to LLMs. They show that even linear predictive models, basically just linear regression, can technically perform any algorithm when used autoregressively like with chain-of-thought. The results aren't that mind-blowing but it made me wonder whether performing certain algorithms correctly with a scratchpad is as much evidence of intelligence as I thought.

One man's modus ponens is another man's modus tollens, and what I do take away from the result is that intelligence with enough compute is too easy to do, so easy that even linear predictive models can do it in theory.

So they don't disprove that intelligent/algorithmic reasoning isn't happening in LLMs, but rather that it's too easy to get intelligence/computation by many different methods.

It's similar to the proof that an origami computer can compute every function computable by a Turing Machine, and if in a hypothetical world we were instead using very large origami pieces to build up AIs like AlphaGo, I don't think that there would be a sense in which it's obviously not reasoning about the game of Go.

https://www.quantamagazine.org/how-to-build-an-origami-computer-20240130/

I agree that origami AIs would still be intelligent if implementing the same computations. I was trying to point at LLMs potentially being 'sphexish': having behaviors made of baked if-then patterns linked together that superficially resemble ones designed on-the-fly for a purpose. I think this is related to what the "heuristic hypothesis" is getting at.

IMO, I think the heuristic hypothesis is partially right, but partially right is the keyword, in the sense that LLMs both will have sphexish heuristics and mostly clean algorithms for solving problems.

I also expect OpenAI to broadly move LLMs from more heuristic-like reasoning to algorithmic-like reasoning, and o1 is slight evidence towards more systematic reasoning in LLMs.

Thanks for the pointer! I skimmed the paper. Unless I'm making a major mistake in interpreting the results, the evidence they provide for "this model reasons" is essentially "the models are better at decoding words encrypted with rot-5 than they are at rot-10." I don't think this empirical fact provides much evidence one way or another.

To summarize, the authors decompose a model's ability to decode shift ciphers (e.g., Rot-13 text: "fgnl" Original text: "stay")  into three categories, probability, memorization, and noisy reasoning.

Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, 'The word is "WORD"'). The model is more likely to decode words that are more likely a priori—this makes sense.

Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there's more of it in the training data, and the model probably has specialized circuitry for rot-13.

What they call "noisy reasoning" is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this "noisy reasoning." 

I don't see how you can jump from this empirical result to make claims about the model's ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner. 

Generally though, I think what constitutes as a "heuristic" is somewhat of a fuzzy concept. However, what constitutes as "reasoning" seems even less defined.



 

True that it isn't much evidence for reasoning directly, as it's only 1 task.

As for how we can jump from the empirical result to make claims about it's ability to reason, the reason is that the shift cipher task let's us disentangle commonness and simplicity, where a bag of heuristics that has no uniform and compact description work best for common example types, whereas the algorithmic reasoning that I defined below would work better on simpler tasks, where the simplest shift cipher is 1-shift cipher, whereas the bag of heuristics model which predicts that LLMs are essentially learning shallow heuristics completely or primarily would work best on 13-shift ciphers, as that's the most common, and the paper shows that there is a spike on the 13-shift cipher accuracy, consistent with LLMs having some heuristics, but also that the 1-shift cipher accuracy was much better than expected under a view that though LLMs were solely or primarily a bag of heuristics that couldn't be improved by COT.

I'm defining reasoning more formally in the quote below:

So an "algorithm" is a finite description of a fast parallel circuit for every size.

This comment is where I got the quote from:

https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#Bg5s8ujitFvfXuop8

This thread has an explanation of why we can disentangle noisy reasoning from heuristics, as I'm defining the terms here, so go check that out below:

https://x.com/RTomMcCoy/status/1843325666231755174

I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy's definitions of heuristics and reasoning is sensible, although I personally would still avoid "reasoning" as a word since people probably have very different interpretations of what it means. I like the ideas of "memorizing solutions" and "generalizing solutions."

I think where McCoy and I depart is that he's modeling the entire network computation as a heuristic, while I'm modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call "reasoning." 

The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an "algorithm." I don't think there's anything inherently wrong with that. 

However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs.

This is what I meant when I said that this could be a useful framework for interpretability :)

Now I understand.

Though I'd still claim that this is evidence towards the view that there is a generalizing solution that is implemented inside of LLMs, and I wanted people to keep that in mind, since people often treat heuristics as meaning that it doesn't generalize at all.

since people often treat heuristics as meaning that it doesn't generalize at all.

Yeah and I think that's a big issue! I feel like what's happening is that once you chain a huge number of heuristics together you can get behaviors that look a lot like complex reasoning. 

some issues related to causal interpretations

Could you refer to the line you are referring to from Marks et al.?

Sorry, I linked to the wrong paper! Oops, just fixed it. I meant to link to Aaron Mueller's Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks.

Sorry, not on topic, but your post title reminds me of the game Milk Inside a Bag of Milk Inside a Bag of Milk.