I've only skimmed this post, but I like it because I think it puts into words a fairly common model (that I disagree with). I've heard "it's all just a stack of heuristics" as an explanation of neural networks and as a claim that all intelligence is this, from several people. (Probably I'm overinterpreting other people's words to some extent, they probably meant a weaker/nuanced version. But like you say, it can be useful to talk about the strong version).
I think you've correctly identified the flaw in this idea (it isn't predictive, it's unfalsifiable, so it isn't actually explaining anything even if it feels like it is). You don't seem to think this is a fatal flaw. Why?
You seem to answer
However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.
But I don't see why "heuristics based decompositions will be human-understandable" is an implication of the theory. As an extreme counterexample, logic gates are interpretable, but when stacked up into a computer they are ~uninterpretable. It looks to me like you've just tacked an interpretability hypothesis onto a heuristics hypothesis.
Thanks for reading my post! Here's how I think this hypothesis is helpful:
It's possible that we wouldn't be able to understand what's going on even if we had some perfect way to decompose a forward pass into interpretable constituent heuristics. I'm skeptical that this would be the case, mostly because I think (1) we can get a lot of juice out of auto-interp methods and (2) we probably wouldn't need to simultaneously understand that many heuristics at the same time (which is the case for your logic gate example for modern computers). At the minimum, I would argue that the decomposed bag of heuristics is likely to be much more interpretable than the original model itself.
Suppose that the hypothesis is true, then it at least suggests that interpretability researchers should put in more efforts to try find and study individual heuristics/circuits, as opposed to the current more "feature-centric" framework. I don't know how this would manifest itself exactly, but it felt like it's worth saying. I believe that some of the empirical work I cited suggests that we might make more incremental progress if we focused on heuristics more right now.
I think the problem might be that you've given this definition of heuristic:
A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training data. There are multiple heuristics in each layer and their outputs are used in later layers.
Taking this definition seriously, it's easy to decompose a forward pass into such functions.
But you have a much more detailed idea of a heuristic in mind. You've pointed toward some properties this might have in your point (2), but haven't put it into specific words.
Some options: A single heuristic is causally dependent on <5 heuristics below and influences <5 heuristics above. The inputs and outputs of heuristics are strong information bottlenecks with a limit of 30 bits. The function of a heuristic can be understood without reference to >4 other heuristics in the same layer. A single heuristic is used in <5 different ways across the data distribution. A model is made up of <50 layers of heuristics. Large arrays of parallel heuristics often output information of the same type.
Some combination of these (or similar properties) would turn the heuristics intuition into a real hypothesis capable of making predictions.
If you don't go into this level of detail, it's easy to trick yourself into thinking that (2) basically kinda follows from your definition of heuristics, when it really really doesn't. And that will lead you to never discover the value of the heuristics intuition, if it is true, and never reject it if it is false.
I agree that if you put more limitations on what heuristics are and how they compose, you end up with a stronger hypothesis. I think it's probably better to leave that out and try do some more empirical work before making a claim there though (I suppose you could say that the hypothesis isn't actually making a lot of concrete predictions yet at this stage).
I don't think (2) necessarily follows, but I do sympathize with your point that the post is perhaps a more specific version of the hypothesis that "we can understand neural network computation by doing mech interp."
A useful thread on the bag of heuristics versus actual noisy algorithmic reasoning has some interesting results, and the results show that at least with COT added to LLMs, LLMs aren't just a bag of heuristics, and do actual reasoning.
Of course, there is still pretty significant bag of heuristic reasoning, but I do think the literal claim that a bag of heuristics is all there is in LLMs is false.
You've claimed that it would be useful to think about the search/planning process as being implemented through heuristics, and I think this is sometimes true that some parts of search/planning are implemented through heuristics, but I don't think that's all there is to an LLM planning/searching, either now or in the future for LLMs
The thread is below:
The paper "Auto-Regressive Next-Token Predictors are Universal Learners" made me a little more skeptical of attributing general reasoning ability to LLMs. They show that even linear predictive models, basically just linear regression, can technically perform any algorithm when used autoregressively like with chain-of-thought. The results aren't that mind-blowing but it made me wonder whether performing certain algorithms correctly with a scratchpad is as much evidence of intelligence as I thought.
One man's modus ponens is another man's modus tollens, and what I do take away from the result is that intelligence with enough compute is too easy to do, so easy that even linear predictive models can do it in theory.
So they don't disprove that intelligent/algorithmic reasoning isn't happening in LLMs, but rather that it's too easy to get intelligence/computation by many different methods.
It's similar to the proof that an origami computer can compute every function computable by a Turing Machine, and if in a hypothetical world we were instead using very large origami pieces to build up AIs like AlphaGo, I don't think that there would be a sense in which it's obviously not reasoning about the game of Go.
https://www.quantamagazine.org/how-to-build-an-origami-computer-20240130/
I agree that origami AIs would still be intelligent if implementing the same computations. I was trying to point at LLMs potentially being 'sphexish': having behaviors made of baked if-then patterns linked together that superficially resemble ones designed on-the-fly for a purpose. I think this is related to what the "heuristic hypothesis" is getting at.
IMO, I think the heuristic hypothesis is partially right, but partially right is the keyword, in the sense that LLMs both will have sphexish heuristics and mostly clean algorithms for solving problems.
I also expect OpenAI to broadly move LLMs from more heuristic-like reasoning to algorithmic-like reasoning, and o1 is slight evidence towards more systematic reasoning in LLMs.
Thanks for the pointer! I skimmed the paper. Unless I'm making a major mistake in interpreting the results, the evidence they provide for "this model reasons" is essentially "the models are better at decoding words encrypted with rot-5 than they are at rot-10." I don't think this empirical fact provides much evidence one way or another.
To summarize, the authors decompose a model's ability to decode shift ciphers (e.g., Rot-13 text: "fgnl" Original text: "stay") into three categories, probability, memorization, and noisy reasoning.
Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, 'The word is "WORD"'). The model is more likely to decode words that are more likely a priori—this makes sense.
Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there's more of it in the training data, and the model probably has specialized circuitry for rot-13.
What they call "noisy reasoning" is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this "noisy reasoning."
I don't see how you can jump from this empirical result to make claims about the model's ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner.
Generally though, I think what constitutes as a "heuristic" is somewhat of a fuzzy concept. However, what constitutes as "reasoning" seems even less defined.
True that it isn't much evidence for reasoning directly, as it's only 1 task.
As for how we can jump from the empirical result to make claims about it's ability to reason, the reason is that the shift cipher task let's us disentangle commonness and simplicity, where a bag of heuristics that has no uniform and compact description work best for common example types, whereas the algorithmic reasoning that I defined below would work better on simpler tasks, where the simplest shift cipher is 1-shift cipher, whereas the bag of heuristics model which predicts that LLMs are essentially learning shallow heuristics completely or primarily would work best on 13-shift ciphers, as that's the most common, and the paper shows that there is a spike on the 13-shift cipher accuracy, consistent with LLMs having some heuristics, but also that the 1-shift cipher accuracy was much better than expected under a view that though LLMs were solely or primarily a bag of heuristics that couldn't be improved by COT.
I'm defining reasoning more formally in the quote below:
So an "algorithm" is a finite description of a fast parallel circuit for every size.
This comment is where I got the quote from:
This thread has an explanation of why we can disentangle noisy reasoning from heuristics, as I'm defining the terms here, so go check that out below:
I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy's definitions of heuristics and reasoning is sensible, although I personally would still avoid "reasoning" as a word since people probably have very different interpretations of what it means. I like the ideas of "memorizing solutions" and "generalizing solutions."
I think where McCoy and I depart is that he's modeling the entire network computation as a heuristic, while I'm modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call "reasoning."
The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an "algorithm." I don't think there's anything inherently wrong with that.
However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs.
This is what I meant when I said that this could be a useful framework for interpretability :)
Now I understand.
Though I'd still claim that this is evidence towards the view that there is a generalizing solution that is implemented inside of LLMs, and I wanted people to keep that in mind, since people often treat heuristics as meaning that it doesn't generalize at all.
since people often treat heuristics as meaning that it doesn't generalize at all.
Yeah and I think that's a big issue! I feel like what's happening is that once you chain a huge number of heuristics together you can get behaviors that look a lot like complex reasoning.
some issues related to causal interpretations
Could you refer to the line you are referring to from Marks et al.?
Sorry, I linked to the wrong paper! Oops, just fixed it. I meant to link to Aaron Mueller's Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks.
Sorry, not on topic, but your post title reminds me of the game Milk Inside a Bag of Milk Inside a Bag of Milk.
Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”
This is a somewhat living document as I come back and add more ideas.
The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need
Why would you want to use the heuristics-based framework when thinking about neural networks?
Feel free to jump around this post and check out the sections that interests you. Each section is mostly independent of the others.
How can interpretability win if the hypothesis is true?
I want to first clarify something I do not think we need to win: a one-to-one mapping between neural network computation and heuristics. I believe that we can have multiple acceptable heuristics-based explanations for a given forward pass (i.e., a one-to-many map). Any explanation that fits the following criteria—mostly copied from the original IOI paper—would be sufficient for “understanding why the model did what it did.”
I believe a bag of heuristics is the easiest way to fulfill these four criteria on arbitrary inputs.
Corollary: Understanding neural network computation does not require us to learn “true features” as long as we have some set of faithful, complete, minimal, and comprehensible heuristics
A central way people evaluate sparse autoencoders (SAEs) is whether they find a set of “true” features. Researchers have varying intuitions on what true features should be, but a common theme is that they should be atomic (i.e., not composed of linear combinations of other features). This has led to people worrying that the sparsity term in the SAE loss leads to models combining commonly occurring atomic features into a single one (e.g., a red triangle feature instead of a red feature and a triangle feature, see also the recent work on feature absorption).
While learning intermediate variables in neural networks is a useful subgoal, I’m worried that the pursuit of atomic features—especially given that we can already get some sort of feature decompositions—is not the most productive task we could work on right now.
We should only care about features insofar as they are the inputs and outputs of heuristics/circuits, and we should only care about monosemanticity insofar as it helps us understand the network. If our heuristic decomposition is faithful, complete, and minimal, it doesn’t matter if individual heuristics take non-atomic concepts as inputs as long as we humans can understand the composed concept (likely given AI aid).
Weak to strong winning
Here are various degrees of winning if the heuristics hypothesis is true.
Weak victory: We can decompose every forward pass into heuristics composed with each other. That is, we can throw away the rest of the activations and use only the heuristics to reconstruct the input-output relationship to a high fidelity.
Medium victory: In addition to individual forward passes, we understand sets of heuristics that a model uses to solve what humans can think of as “tasks.” That is, we understand all heuristics that handle a certain class of inputs (e.g., the IOI circuit).
Strong victory: We know every heuristic in the model and how they compose, which is analogous to the “reverse engineer a neural network” end goal.
Miscellaneous thoughts on interpretability with heuristics hypothesis
Interpretability with heuristics is not very different from existing circuits analysis. The main ideas I came up with is the focus on heuristics as the key unit of analysis and being explicitly OK with many different potential explanations/levels of abstraction. As a result, it’s not clear if there’s anything major that we need to do differently. Sparse feature circuits, transcoders, and automated circuit discovery techniques already popular in the literature seem to be reasonable ways to proceed even if our end goal is a set of heuristics.
However, given that a weak victory does not require an enumeration of all features/heuristics, it might be worth the time to try to discover more compute efficient ways to understand a single forward pass.
I also haven’t defined what a heuristic is because I’m genuinely not sure what the best level of abstraction would be. Here are some types of simple functions that I would consider as a “heuristic”
Let me know if you have any other ideas.
A few more thoughts on verifying how correct the heuristics-based explanations are. I think there are two levels, the heuristics level, and the model level. At the heuristics level, we want to make sure that each individual heuristic is faithful to the underlying neural network computation. Ideally this could be done at the weight level, but we can also apply our bag of existing interpretability techniques.
At the model level, my hope is that we can use our interpretability techniques to discover new algorithms in the form of compositing heuristics that we don’t know how to write. One of my first memorable interactions with ChatGPT was when I asked it to help me rephrase some survey questions I was working on, and it was actually really helpful. We currently have no idea how to write down a program to do that! Learning all the heuristics involved for various tasks could be a path towards some form of Microscope AI. And, as is the case with circuits analysis, these algorithms fall out naturally once we construct the heuristics.
What does it mean for alignment theory if the heuristics hypothesis is true?
(I’ve spent orders of magnitude less time and effort on this section compared to the interp section, but I figured I’d mention a few ideas and collect some feedback. If people actually like this hypothesis I’ll spend some more time thinking through this)
I’m not super sure if the heuristics framework alone could make concrete predictions on key aspects of alignment theory. You can approximate any function arbitrarily closely with heuristics. In other words, as systems advance, any sort of high-level behavior could emerge even if it’s all heuristics operating below (see, e.g., Interpretability/Tool-ness/Alignment/Corrigibility are not Composable, which is also a problem when we aggregate heuristics from each layer together).
However, the more powerful future model you’re worried about won’t just fall out of a coconut tree.[6] We need to understand its learning process and how it became powerful.
If learning more heuristics is all we need to get to more and more powerful systems, we should understand what types of heuristics and heuristics composition are learned first. Two relevant papers that come to mind are the quantization model of scaling, and work on which concepts are learned first in toy models by Park et al.. Work done here could help us understand if we’ll see, for example, capabilities generalization without alignment generalization. Generally though, it would be cool to see how alignment related concepts are learned and used compared to non-alignment related concepts.
On the surface, it seems like shard theory is more likely to be correct in the world where the heuristics hypothesis is true, although shards are higher-level abstractions compared to heuristics. I’d want to see some more concrete interpretability findings before making a strong claim though.
One opinion that I hold a bit more strongly after thinking through this post is that we could continue to get very economically useful models that are nonetheless incoherent in other ways. In the heuristics world, there's less reason to believe in discontinuous jumps in performance, and more reason to believe that AIs will get really good at some things while still bad at others (see also Boaz on the shape of AGI).
Empirical studies related to the heuristics hypothesis (both in support and against)
We need to keep in mind that the streetlight effect is certainly contaminating our evidence. That is, simple heuristics are easier for interpretability researchers to recover than complex data structures, and we should expect more evidence for them.
It’s also cool to think through some other, general neural network phenomenon with the heuristics hypothesis in mind. It makes sense for the network to have some sort of redundancies (e.g., backup name mover heads) if there are similar heuristics learned at the same time. It makes sense that you could get the network to output whatever arbitrary text you want with an optimized string, since you can activate a weird set of heuristics and compose them. As heuristics compose from one layer to the next, they would need intermediate variables to communicate their results. Thus it makes sense that activation engineering works well across a wide range of concepts.
(There are some other results that come to mind which makes less sense, e.g., the 800 orthogonal code steering vectors, although I think Nina Panickssery’s explanation, if true, would be consistent with the heuristics hypothesis)
Weaknesses in the Heuristics Hypothesis
Some versions of the hypothesis are unfalsifiable
This theory’s biggest weakness is that we can decompose basically anything into a bag of composing heuristics given an infinite bag size. In other words, the heuristics hypothesis is technically consistent with every single hypothesis of how future systems would behave.
I do feel like this theory “explains too much.” However, the key interpretability-related claim is that heuristics based decompositions will be human-understandable, which is a more falsifiable claim.
The current features-focused research agendas might be the best way to uncover heuristics, and we don’t actually need to do anything different regardless how true the heuristics hypothesis is.
The best way to locate heuristics might start with trying to find the most monosematic/atomic features, understand their functional implications, train transcoders, or following something like Lee Sharkley’s sparsify agenda. In other words, the new framing doesn’t add much. It’s also possible that we wouldn’t be able to achieve weak victory on a given task without understanding the whole task family, in which case the idea of a weak victory doesn’t really matter.
Perhaps this is true, but I think it’s worth thinking through this some more. I’m worried that the field has focused on features mostly due to path dependence from the original circuits thread that posited features as “the fundamental unit of neural networks” (although certainly not all researchers are focused on features, see e.g. Geiger’s causal abstractions agenda). Also, training SAEs that catalog all the features of a model is expensive and unnecessary for the weak victory condition I mentioned above. Trying to find cheaper interpretability techniques that are just meant to understand individual forward passes seems like a worthy thing to try.
Getting heuristics that are causally related to a specific output does not necessarily help monitor a model’s internal thoughts.
I’m not super sure if that’s true? I think it’s reasonable to assume that two different sets of heuristics would be active in the case where the model is deceptive versus not deceptive, even conditional on the final token logits looking the same.
Inspirations and related work that I haven’t already mentioned
Potential next steps
(Yet another section that I wrote rather quickly in the interests of getting some more feedback. I’m also ~60% sure that my specific research interests will shift in the next six months)
I can see four major directions for further exploration of the heuristics hypothesis
Deconfusion: What exactly is a heuristic, and what does a heuristics-based explanation look like?
Although this is a fairly fundamental question, I’m not super worried about needing to get this completely right before trying to look for heuristics in novel settings. I think we can make a lot of progress even with imperfect definitions. Still, applying the heuristics perspective to circuits we already understand (e.g., IOI) and trying to formalize what exactly heuristics are and aren’t seems useful.
Creating new interpretability methods that are centered around heuristics as the fundamental unit
This is speculative, but it might be worth spending some time to figure out if there are ways to directly study heuristics as their own unit. Distributed alignment search (DAS) is the closest idea that comes to mind, but (to my understanding, I could be wrong, sorry!) DAS is a supervised method that requires the researcher to have some causal model in mind before trying to find it in the neural network. Transcoders represent another attempt, but those require cataloging all features in the training data.
The worry is that the field got locked into looking for features and feature circuits for mostly path-dependence reasons, and there could be some low hanging fruit if we just thought harder about heuristics, especially given the recent evidence that they might play a big role.
Using existing interpretability tools to discover heuristics
This is a much more tractable option to better understand heuristics, especially given the similarities between heuristics and circuit building.
We can try to catalog individual heuristics manually by coming up with natural language tasks where we believe that the model would need to execute some heuristic at one point. By studying various individual heuristics, it could also help inspire specialized interpretability techniques to uncover them en masse. For example,
We could also leverage the existing SAE and treat features as inputs/outputs of heuristics. In this case, I’m hoping to advance beyond the gradient based attributions used in studies such as the Spare Feature Circuits paper. We can perhaps use gradient attribution to narrow down on the nodes and edges that we care about, but then focus on how, operationally, each edge is formed. The gradient attribution gives us only if-then relationships. Is that what’s locally happening with the model?
Applying the heuristics-framework to study theoretical questions in alignment.
If we decide that the heuristic model of computation is true/useful, I’d be most excited to use it to study more theoretical topics and perhaps use it to forecast where future capabilities gains could come from. For example, Alex Turner said (two years ago):
For example, we could study the dynamics of heuristics learning and composition in real world models, especially heuristics related to turning base models into assistants. One guess is that RLHF is sample efficient because it mostly changed how heuristics are composed with each other (and maybe boost existing heuristics to be more active), which might be a lot easier than learning new heuristics.[7] This would build on top of work done on toy models by Hidenori Tanaka’s group, and also maybe the quanta scaling hypothesis.
I’m currently trying to get into the AI safety field and will also be applying to MATS. Let me know if you’re interested in chatting more about any of these topics. Have a low bar for reaching out.
This post benefited from the feedback from Jack Zhang, Joe Campbell, Mat Allen, Tim Kostolansky, Veniamin Veselovsky, and woog. All errors are my own.
This definition of generalization comes from Okawa et al. (2023)
Source? I made it up
I really struggled to understand this paper :( Would be down to go through it with someone.
[Citation needed]
Funny story but I almost wrote down Rome. The real rank one model editing is the one they did to my brain.
See also, AI presidents discuss AI alignment agendas.
Counterpoint: maybe learning new heuristics is easy and frontier models just have a good ability to learn by the time they’re done with pretraining.