Tl;dr I've decided to shift my research from mechanistic interpretability to more empirical ("prosaic") interpretability / safety work. Here's why.

All views expressed are my own. 

What really interests me: High-level cognition

I care about understanding how powerful AI systems think internally. I'm drawn to high-level questions ("what are the model's goals / beliefs?") as opposed to low-level mechanics ("how does the model store and use [specific fact]?"). Sure, figuring out how a model does modular addition is cool, but only insofar as those insights and techniques generalise to understanding higher-level reasoning.

Mech interp has been disappointing 

Vis-a-vis answering these high-level conceptual questions, mechanistic interpretability has been disappointing. IOI remains the most interesting circuit we've found in any language model. That's pretty damning. If mechanistic interpretability worked well, we should have already mapped out lots of interesting circuits in open-source 7B models by now. 

The field seems conceptually bottlenecked. We simply can't agree on what 'features' are or how to 'extract' them. I'm also not sure that this conceptual ennui will be resolved anytime soon. 

Doing mech interp research led me to update against it

Some time ago, I was pretty optimistic that things would change quickly. After hearing about sparse feature circuits, I became incredibly convinced that approaches like this would 'finally' allow us to understand language models end to end. 

So I committed fully to the nascent SAE bandwagon. At a hackathon, I worked on building a tool for visualizing sparse feature circuits.  When I got the chance, I threw myself into Neel Nanda's MATS 6.0 training phase, where I similarly worked (with the excellent @jacob_drori) on extending sparse feature circuits with MLP transcoders. Overall there were signs of life, but it turned out kind of mid and my main takeaway was 'existing SAEs might not be good enough to tell us anything useful about circuits'. As I continued working on various interp related things, I hit other roadblocks. Concrete example: I tried looking for refusal circuits in Gemma-2b and largely didn't find anything interesting.[1]

In hindsight, my object-level findings weren't very good and didn't inform my object-level takes much. On a more meta level, though, I came away thinking more strongly that: Model internals are messy. Really messy in ways we can't simply explain. This means that our best paradigms are at best incomplete and at worst actively misleading.  

"Prosaic Interpretability" 

I'm therefore coining the term "prosaic interpretability" - an approach to understanding model internals that isn't strongly based on a pre-existing theory of neural networks or intelligence[2], but instead aims to build intuitions / dogma from the ground up, based on empirical observation. 

Concretely, I've been really impressed by work like Owain Evans' research on the Reversal Curse, Two-Hop Curse, and Connecting the Dots[3]. These feel like they're telling us something real, general, and fundamental about how language models think. Despite being primarily empirical, such work is well-formulated conceptually, and  yields gearsy mental models of neural nets, independently of existing paradigms. 

How does this compare to mech interp? Both are fundamentally bottom-up methods to answering top-down questions. But with mech interp it feels like the focus is often too heavily on the method - trying to prove that some paradigm (LRH, SAEs, steering vectors, what have you) is a valid way to approach a problem. With prosaic interp I'd argue that the focus is, instead, on hugging the question tightly, exploring it from multiple angles, considering multiple adjacent questions, and delivering an honest answer. 

Intuition pump: Gene analysis for medicine

Using mechanistic interpretability for AI safety is like trying to cure diseases by understanding every single gene in the human genome. Obviously, when it works, it's incredibly powerful. There are lots of diseases which have been treated in this way. And the big sunk cost can be amortised over lots of potential future application. 

At the same time, there are diseases that continue to elude effective treatment despite our understanding of the human genome. 

Prosaic work is more like testing different treatments to see what actually helps people, and using that to make scientific inferences. Like inoculation being practised long before Robert Koch developed his germ theory of disease. This might not give you the deepest possible understanding, but it often gets results faster. And shorn of understanding, it seems like the only way forward. 

Modern AI Systems will make interpretability difficult

AI systems aren't just transformers anymore - they have all sorts of extra bits bolted on, like scaffolding and tool use and inference-time algorithms and swarm architectures. Mechanistic interpretability is stuck looking at individual transformers and their neurons, while the actual frontier keeps moving. We're studying pieces of systems that are becoming less and less like what's actually being deployed. Each day, the world of 'frontier AI system' continues to expand. The view from the platform of 'transformer circuits' is that of a rapidly receding horizon. 

Prosaic work doesn't have this problem. It's always kept its eyes on the whole system. 

The timing is frustrating

I feel like I'm stepping away just as mechanistic interpretability is about to get really interesting. Practical demos are starting to emerge of interpretability-based alignment being effective and beating baselines. Thanks to startups like Goodfire and Transluce, such techniques may even become real products. Longtime critics are retracting their doubts. And fundamental work continues to charge full steam ahead; developments like MDL SAEs or Matryoshka SAEs could turbocharge SAE-based interpretability. In the near future, we might even be able to train models to be interpretable. All of this adds up to the wider public being bullish on interp, much more so than ever before.  

My personal pessimism is coinciding with a sustained backdrop of broader optimism - and this makes me feel very conflicted about deciding to step away. 

Personal fit

I spent the last 6 months trying to make progress on mechanistic interpretability. I think I'm reasonably competent. But I just didn't get very far. There are many mundane contributing factors to this. Among others: a lack of good mentorship / collaboration opportunities, poor self-management, mediocre research taste. But I think the biggest issue is motivation. 

A hard truth I've learned about myself: I don't like working on "fundamental" mechanistic interpretability methods. I'm not frothing with passion to think about how the compositions of high-dimensional matrices can be made slightly more tractable. It feels too disconnected from the high-level conceptual questions I really care about. And "applied" work feels like it's best left in the hands of domain experts who have deep, hard-won intuitions about the things they are trying to interpret. 

The stuff I get most excited about is red-teaming existing interpretability work. This is (broadly) the subject of both my first NeurIPS paper and my hitherto highest-effort LessWrong piece. I like this work because it's highly conceptual and clarifies subsequent thinking. (Possibly I also just enjoy criticising things.) I'd be open to doing more of this in the future. But red-teaming isn't exclusive to mech interp. 

Overall, I feel like I've given mech interp a fair shot and I should roll the dice on something different. 

Mech interp research that excites me

To be clear, I remain excited about specific research directions within mechanistic interpretability. "Training models to be interpretable" seems robustly good. Here I'm excited by things like gradient routing and mixture of monosemantic experts. If someone figures out how to train SAEs to yield sparse feature circuits that'll also be a big win. "Automating / scaling interpretability" also seems like another robustly good direction, since it leverages improvements in capabilities. I don't have a good read of this space, but things like PatchScopes / SelfIE seem interesting. Edge pruning also seems like a viable path to scaling circuit discovery to larger models (and is the only work I've ever seen so far that claims to find a circuit in a 7b+ size model) 

Looking forward

I'm not ruling out coming back to mechanistic interpretability. I'll likely continue to keep tabs on the field. And I'll probably always be happy to discuss / critique new research.

But for now, I'm stepping away. I'm starting MATS with Owain Evans in January, and my work there will likely focus on other approaches. I'll keep tabs on the field, but I need to focus on work that better fits my thinking style and research interests.

I'm looking forward to it. 

  1. ^

    In my analysis, base-model SAEs also didn't turn up anything interesting re: refusal features. This has since been validated independently; base-model SAEs do not capture the refusal direction

  2. ^

    This mostly fits with Paul Christiano's definition of prosaic AI alignment.

  3. ^

    To avoid claims of bias, some non-Owain examples are how deep safety training improves alignment robustness, comprehensively analysing grokking, comparing data attribution of factual vs procedural knowledge and investigating latent reasoning in LLMs. Things like 'understanding chain of thought faithfulness' also go in here. 

New Comment
32 comments, sorted by Click to highlight new comments since:

A few thoughts on names, since that's being raised by other comments:

  • There's value in a research area having a name, since it helps people new to the field become aware that there's something there to be looked into.
  • On the other hand, I worry that this area is too diffuse and too much of a grab bag to crystallize into a single subfield; see eg Rudolf's point about it being debatable whether work like Owain Evans's is really well-described as 'interpretability'.
  • I think it's also worth worrying that naming an area potentially distracts from what matters, namely finding the most effective approaches regardless of whether they fit comfortably into a particular subfield; it arguably creates a streetlight that makes it easier to see some things and harder to see others.
  • One potential downside to 'prosaic interpretability' is that it suggests 'prosaic alignment', which I've seen used to refer to techniques that only aim to handle current AI and will fail to help with future systems; it isn't clear to me that that's true of (all of) this area.
  • 'LLM Psychology' and 'Model Psychology' have been used by Quentin Feuillade-Montixi and @NicholasKees to mean 'studying LLM systems with a behavioral approach', which seems like one aspect of this nascent area but by no means the whole. My impression is that 'LLM psychology' has unfortunately also been adopted by some of the people making unscientific claims about eg LLMs as spiritual entities on Twitter.
  • A few other terms that have been used by researchers to describe this and related areas include 'high-level interpretability' (both in Jozdien's usage and more generally), 'concept-based interpretability' (or 'conceptual interpretability'), 'model-agnostic interpretability'.
  • Others suggested below so far (for the sake of consolidation): 'LLM behavioural science', 'science of evals', 'cognitive interpretability', 'AI cognitive psychology'

Great list! Going to add 'empirical interpretability', which attempts to induce the least number of unwanted associations while conveying the essence of what I imagine

Great post! I'm also a big (though biased) fan of Owain's research agenda, and share your concerns with mech interp.

I'm therefore coining the term "prosaic interpretability" - an approach to understanding model internals [...]

Concretely, I've been really impressed by work like Owain Evans' research on the Reversal Curse, Two-Hop Curse, and Connecting the Dots[3]. These feel like they're telling us something real, general, and fundamental about how language models think. Despite being primarily empirical, such work is well-formulated conceptually, and  yields gearsy mental models of neural nets, independently of existing paradigms. 

[emphasis added]

I don't understand how the papers mentioned are about understanding model internals, and as a result I find the term "prosaic interpretability" confusing.

Some points that are relevant in my thinking (stealing a digram from an unpublished draft of mine): 

  • the only thing we fundamentally care about with LLMs is the input-output behaviour (I-O)
  • now often, a good way to study the I-O map is to first understand the internals M
  • but if understanding the internals M is hard but you can make useful generalising statements about the I-O, then you might as well skip dealing with M at all (c.f. psychology, lots of econ, LLM papers like this)
  • the Owain papers you mention seem to me to make 3 distinct types of moves, in this taxonomy:
    • finding some useful generalising statement about the I-O map behaviour (potentially conditional on some property of the training data) (e.g. the reversal curse)
    • creating a modified model M' from M via fine-tuning on same data (but again, not caring about what the data actually does to the internals)
    • (far less centrally than the above!) speculating about what the internal structure that causes the behavioural patterns above might be (e.g. that maybe models trained on "A=B" learn to map representation(A) --> representation(B) in some MLP, instead of learning the general rule that A and B are the same thing and representing them internally as such)

So overall, I don't think the type of work you mention is really focused on internals or interpretability at all, except incidentally in minor ways. (There's perhaps a similar vibe difference here to category theory v set theory: the focus being relations between (black-boxed) objects, versus the focus being the internals/contents of objects, with relations and operations defined by what they do to those internals)

I think thinking about internals can be useful—see here for a Neel Nanda tweet arguing the reversal curse if obvious if you understand mech interp—but also the blackbox research often has a different conceptual frame, and is often powerful specifically when it can skip all theorising about internals while still bringing true generalising statements about models to the table.

And therefore I'd suggest a different name than "prosaic interpretability". "LLM behavioural science"? "Science of evals"? "Model psychology"? (Though I don't particularly like any of these terms)

Glad you enjoyed it! 
 

the Owain papers you mention seem to me to make 3 distinct types of moves, in this taxonomy:

  • finding some useful generalising statement about the I-O map behaviour (potentially conditional on some property of the training data) (e.g. the reversal curse)
  • creating a modified model M' from M via fine-tuning on same data (but again, not caring about what the data actually does to the internals)
  • (far less centrally than the above!) speculating about what the internal structure that causes the behavioural patterns above might be (e.g. that maybe models trained on "A=B" learn to map representation(A) --> representation(B) in some MLP, instead of learning the general rule that A and B are the same thing and representing them internally as such)

Found this taxonomy pretty insightful, thanks! Note that "creating a modified model M' from M" is something that has obvious parallels to mechanistic interpretability (e.g. this is what happens when we do any form of activation patching, steering, etc). Mechanistic interpretability also often starts from a "useful generalising statement", e.g. in IOI there's a clear way to programmatically infer the output from the input. Other 'classic' mech interp circuits start from similarly formulaic data. I think the similarities are pretty striking even if you don't care to dig further 

 

I don't understand how the papers mentioned are about understanding model internals, and as a result I find the term "prosaic interpretability" confusing.

I agree in hindsight that 'model internals' is a misnomer. As you say, what we actually really care about is functional understanding (in terms of the input-output map), not representational understanding (in terms of the components of the model), and in future writing I'll re-frame the goal as such. I still argue that having a good functional understanding falls under the broader umbrella of 'interpretability', e.g. training a smaller, more interpretable model that locally predicts the larger model is something that has been historically called 'interpretability' / 'explainability'. 

I also like keeping 'interpretability' in the name somewhere, if only to make clear that the aspirational goals and theory of change are likely very similar to mechanistic interpretability

So overall, I don't think the type of work you mention is really focused on internals

I agree with this! At the same time I think this type of work will be the backbone of the (very under-specified and nascent) agenda of 'prosaic interpretability'. In my current vision, 'prosaic interpretability' will stitch together a swathe of different empirical results into a broader fabric of 'LLM behavioural science'. This will in turn yield gearsy models like Jan Kulveit's three layer model of LLM behavioural science or the Waluigi effect

Aside: I haven't been working with Owain very long but it's already my impression that in order to do the sort of work his group does, it's necessary to have a collection of (probably somewhat-wrong but still useful) gearsy models like the above that point the way to interesting experiments to do. Regrettably these intuitions are seldom conveyed in formal writing because they are less defensible than the object-level claims. Nonetheless I think they are a very fundamental (and understated) part of the work. 

This is pretty similar to my thinking on the subject, although I haven't put nearly the same level of time and effort that you have into doing actual mechanistic interpretability work. Mech interp has the considerable advantage of being a principled way to try to deeply understand neural networks, and if I thought timelines were long it would seem to me like clearly the best approach.

But it seems pretty likely to me that timelines are short, and my best guess is that mech interp can't scale fast enough to provide the kind of deep and comprehensive understanding of the largest systems that we'd ideally want for aligning AI, and so my main bet is on less-principled higher-level approaches that can keep us safe or at least buy time for more-principled approaches to mature.

That could certainly be wrong! It seems plausible that automated mech interp scales rapidly over the next few years and we can make strong claims about larger and larger feature circuits, and ultimately about NNs as a whole. I very much hope that that's how things work out; I'm just not convinced it's probable enough to be the best thing to work on.

I'm looking forward to hopefully seeing responses to your post from folks like @Neel Nanda and @Joseph Bloom who are more bullish on mech interp.

Great points! I agree re: short timelines being the crux.

I chatted to Logan Riggs today, and he argued that improvements in capabilities will make ambitious mech interp possible in time to let us develop solutions to align / monitor powerful AI. This seems very optimistic to say the least, and I remain as yet unconvinced that 'somehow' mech interp will buck the historical trend of having been disappointing. 

Outtakes. 

Here's some stuff I cut from the main piece that nevertheless seems worth saying. 

The mech interp bubble. 

Hot take: The mech interp bubble is real. I've lived in it. To be clear, I'm not saying that all MI researchers live in a bubble, nor that MI is disconnected from adjacent fields. But there seems to be an insidious tendency for MI researchers to subtly self-select into working on increasingly esoteric things that make sense within the MI paradigms but have little to do with the important problems and developments outside of them. 

As a mech interp researcher, your skills and interests mainly resonate with other mech interp researchers. Your work becomes embedded in specific paradigms that aren't always legible to the broader ML or AI safety community. Concrete claim: If you work on SAEs outside a big lab, chances are that your work is not of interest outside the mech interp community. (This doesn't mean it's not valuable. Just that it's niche.) 

(Speaking from personal experience:) Mech interp folk also have a tendency to primarily think about interp-based solutions to problems. About interp-related implications of new developments. About interp-based methods as cool demos of interp, not as practical solutions to problems. 

Tl;dr Mech interp can easily become the lens through which you see the world. This lens is sometimes useful, sometimes not. You have to know when you should take it off. A pretty good antidote is talking to people who do not work on interp.  

Does Mech Interp Yield Transferable Skills? 

This is admittedly based on a sample size of 1. But I'm not sure that mech interp has imparted me with much knowledge in the way of training, fine-tuning, or evaluating language models. It also hasn't taught me much about designing scalable experiments / writing scalable, high-performance research code. 

My impression of what mech interp has taught me is 'don't fool yourself', 'hacker mindset' and 'sprint to the graph' - all of which are really valuable - but these aren't concrete object-level skills or tacit knowledge in the way that the other ones are. 

What is Intelligence Even, Anyway? 

What if, as Gwern proposes, intelligence is simply "search over the space of Turing machines", i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that 'expert knowledge' and 'inductive bias' have largely lost out to the cold realities of scaling compute and data. 

All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we're searching over more and longer Turing machines, and we are applying them in each specific case. 

Otherwise, there is no general master algorithm. There is no special intelligence fluid. It's just a tremendous number of special cases that we learn and we encode into our brains.

If this turns out to be correct, as opposed to something more 'principled' or 'organic' like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk. 

What is Intelligence Even, Anyway? 

What if, as Gwern proposes, intelligence is simply "search over the space of Turing machines", i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that 'expert knowledge' and 'inductive bias' have largely lost out to the cold realities of scaling compute and data. 

All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we're searching over more and longer Turing machines, and we are applying them in each specific case. 

Otherwise, there is no general master algorithm. There is no special intelligence fluid. It's just a tremendous number of special cases that we learn and we encode into our brains.

If this turns out to be correct, as opposed to something more 'principled' or 'organic' like natural abstractions or shard theory, why would we expect to be able to understand it (mechanistically or otherwise) at all? In this world we should be focusing much more on scary demos or evals or other things that seem robustly good for reducing X-risk. 

 

I don't think it can literally be AIXI/search over Turing Machines, because it's an extremely unrealistic model of how future AIs work, but I do think a related claim is true in that inductive biases mattered a lot less than we thought in the 2000s-early 2010s, and this matters.

The pitch for natural abstractions is that compute limits of real AGIs/ASIs force abstractions rather than brute-force simulation of the territory, combined with hope that abstractions are closer to discrete than continuous, combined with hope that other minds naturally learn these abstractions in pursuit of capabilities (I see LLMs as evidence for natural abstractions being relevant), but yes I think that a mechanistic understanding of how AIs work is likely to not exist in time, if at all, so I am indeed somewhat bearish on mech interp.

This is why I tend to favor direct alignment approaches like altering the data over approaches that rely on interpretability.

What is Intelligence Even, Anyway? 

What if, as Gwern proposes, intelligence is simply "search over the space of Turing machines", i.e. AIXI? Currently this definition is what feels closest to the empirical realities in ML capabilities; that 'expert knowledge' and 'inductive bias' have largely lost out to the cold realities of scaling compute and data. 

All we are doing when we are doing “learning,” or when we are doing “scaling,” is that we're searching over more and longer Turing machines, and we are applying them in each specific case. 

Otherwise, there is no general master algorithm. There is no special intelligence fluid. It's just a tremendous number of special cases that we learn and we encode into our brains.


That said, while I'm confident that literally learning special cases and just searching/look-up tables isn't how current AIs work, there is an important degree of truth to this in general, where we are just searching over larger and more objects (though in my case it's not just restricted to Turing Machines, but any set defined formally using ZFC + Tarski's Axiom at minimum, links below):

And more importantly, the maximal generalization of learning/intelligence is just that we are learning ever larger look-up tables, and optimal intelligences look like look-up tables having other look-up tables when you weaken your assumptions enough.

https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory

https://en.wikipedia.org/wiki/Tarski%E2%80%93Grothendieck_set_theory

I view the no-free-lunch theorems as essentially asserting that there exists only 1 method to learn in the worst case, which is the highly inefficient look-up table, and in the general case there are no shortcuts to learning a look-up table, you must pay the full exponential cost of storage and time (in finite domains).

Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.

I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.

One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can't verify this), or this link below:

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/

This is relevant for AI alignment and AI capabilities.

[-]dr_s20

That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).

In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.

I didn't say there's no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data. 


Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.

Sure, but it's unclear whether these inductive biases are necessary. Concrete example: randomly initialised CNN weights can extract sufficiently good features to linearly classify MNIST. Another example: randomly initialised transformers with only embeddings learnable can do modular addition. 

Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don't really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision. 

Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as "modern regularization techniques". I'd say all of that counts as baking in some inductive biases in the model though I agree it's a very light touch.

Sure, but it's unclear whether these inductive biases are necessary.

My understanding is that they're necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.

Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that 'narrow' inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data 

If this turns out to be correct, as opposed to something more 'principled' or 'organic' like natural abstractions or AIXI

AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.

I think that you can't say anything "more principled" than AIXI if you don't account for reality you are already in. Our reality is not generated by random Turing machine, it is generated by very specific program that creates 3+1 space-time with certain space-time symmetries and certain particle fields etc, and "intelligence" here is "how good you are at approximating algorithm for optimal problem-solving in this environment". 

'sprint to the graph'

Can you elaborate what is it?

 AIXI is just search over Turing machines, just computationally unbounded. I am pretty sure that Gwern kept in mind AIXI when he spoke about search over Turing machines.
 

Upon reflection, agree with this

I think that you can't say anything "more principled" than AIXI if you don't account for reality you are already in.

Also agree with this, and I think this reinforces the point I was attempting to make in that comment. I.e. 'search over turing machines' is so general as to yield very little insight, and any further assumptions risk being invalid

'sprint to the graph'

Refers to the mentality of 'research sprints should aim to reach ASAP the first somewhat-shareable research result', usually a single graph

Referring to the section "What is Intelligence Even, Anyway?":

I think AIXI is fairly described as a search over the space of Turing machines. Why do you think otherwise? Or maybe are you making a distinction at a more granular level?

Upon consideration I think you are right, and I should edit the post to reflect that. But I think the claim still holds (if you expect intelligence looks like AIXI then it seems quite unlikely you should expect to be able to understand it without further priors)

Thanks for the post, I agree ~entirely with the descriptive parts (and slightly less the prescriptive parts). I wonder, what do you think of interpretability work that does try to answer top-down questions with top-down methods (which I call high-level interpretability)? For example, unsupervised ways to interface with goals / search in systems (1, 2, 3), or unsupervised honesty (ELK, CCS, this proposal), etc.

(1, 2, 3)

I've not come across 1 before, and I don't think I really understood it from skimming, so I'll reserve judgment. 

Re (2,3): I think this kind of work is quite closely aligned with the conceptual questions I'm interested in, and I like it! For that reason I actually strongly considered both Adria's stream and Erik's stream for MATS 7.0 as well. Ultimately, I chose Owain's stream for the other reasons described above (personal fit, wanting to try something new, worries about AI systems etc) but I would have been pretty happy working on either of these things instead (unfortunately it seems like Erik Jenner does not plan to continue working on the chess interpretability stuff). 

I do think it's a good idea to try probing for as many things as we can think of. If it were possible to train probes on Claude I would want to see if probes can detect alignment faking or in-context scheming. I'm concerned that there are non-obvious disanalogies between the language setting and the board game setting which prevent the findings / approaches from generalising. The only way we'll know is by constructing similar agentic scenarios and attempting to probe for language model goals. 

At the same time, I worry there will not exist a clean representation of the goals; maybe such 'goals' are deeply embedded / entangled / distributed across model weights in a way that makes them hard to elucidate. Positive results on rather toy-ish setups do not update me much away from this concern. 

Also, training probes for things is in itself an attempt to impose your own ontology onto the model, which I am wary of. Models' ontologies can differ in surprising ways from human ones. Maybe there isn't even any global structure to map onto

Food for thought: it's plausible that 'goal states' only exist implicitly as the argmax of V(x). Maybe you don't need to know where you're ultimately going, you just need to know (at every step) what the current best action to take is, w.r.t some objective. Certainly nothing in AIXI precludes this. 

unsupervised ways to interface with goals

I dunno about 1, but I wouldn' call (2, 3) unsupervised. They require you to first observe the agent executing a specific goal-oriented behaviour (pacing in the case of Sokoban, or solving the puzzle in the case of Leela chess), and then probe its activations post-hoc. An 'unsupervised' method preferably wouldn't need this level of handholding. 

unsupervised honesty (ELK, CCS, this proposal)

I haven't followed ELK / CCS closely. The work seemed interesting when it was published a while back. I'm pretty curious as to why there have not been more follow-ups. That is some indication to me that this line of work is subtly flawed in some way, at least methodologically (if not fundamentally). 

It does seem that language models maintain internal beliefs as to what information is true or false. Perhaps that's an indication that this can work. 

I dunno about 1, but I wouldn' call (2, 3) unsupervised. They require you to first observe the agent executing a specific goal-oriented behaviour (pacing in the case of Sokoban, or solving the puzzle in the case of Leela chess), and then probe its activations post-hoc. An 'unsupervised' method preferably wouldn't need this level of handholding. 

I think the directions in (2, 3) aim at being unsupervised. In both cases, the central idea revolves around something like "there's this thing we call search that seems dangerous and really important to understand, and we want to figure out how to interface with it". In principle, this would involve concretizing our intuition for what mechanistically constitutes "search" and interacting with it without requiring behavioural analysis. In practice, we're nowhere near that and we have to rely on observations to form better hypotheses at all.

At the same time, I worry there will not exist a clean representation of the goals; maybe such 'goals' are deeply embedded / entangled / distributed across model weights in a way that makes them hard to elucidate. Positive results on rather toy-ish setups do not update me much away from this concern. 

I agree! That's part of why I don't favour the bottom-up approach as much. For any definition of "goal", there could be a large number of ways it's represented internally. But whatever it is that makes it a "goal" by our reckoning would allow it to be separated out from other parts of the system, if we understood well enough what those properties are. This is harder since you can't rely solely on observational data (or you risk indexing too much on specific representations in specific models), but does have the upside of generalizing if you do it right.

I'm pretty curious as to why there have not been more follow-ups. That is some indication to me that this line of work is subtly flawed in some way, at least methodologically (if not fundamentally). 

I think it's just that it's really hard. Figuring out the shape of a high-level property like "truth" is really hard; methods that make small-but-legible progress like CCS, often turn out in the end to not be that impressive in identifying what they want. That said, mechanistic anomaly detection spun out of ARC's work on ELK and there has been some progress there. There are also people working on it at Anthropic now.

It does seem that language models maintain internal beliefs as to what information is true or false. Perhaps that's an indication that this can work. 

I agree, though I think that that result wasn't much of an update for me - any competent model would have to have some notion of what's true, even if it's not in an easily visible way.

~agree with most takes! 

Figuring out the shape of a high-level property like "truth" is really hard

I'm skeptical of this. Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts. And because we train models to be HHH (in particular, 'honest') I'd expect that HHH-related concepts have an especially high quality representation, similar to refusal

I'm more sympathetic to the claim that models concurrently simulate a bunch of different personas which have inconsistent definitions of subjective truth, and this 'interference' is what makes general ELK difficult

Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts.

I don't think mech interp has been good at robustly identifying representations of high-level abstractions, which is probably a crux. The representation you find could be a combination of things that mostly look like an abstraction in a lot of contexts, it could be a large-but-not-complete part of the model's actual representation of that abstraction, and so on. If you did something like optimize using the extracted representation, I expect you'd see it break. 

In general it seems like a very common failure mode of some alignment research that you can find ways to capture the first few bits of what you want to capture and have it look really good at first glance, but then find it really hard to get a lot more bits, which is where a lot of the important parts are (e.g. if you had an interp method that could explain 80% of a model's performance, that would seem really promising and get some cool results at first, but most gains in current models happen at the ~99% range).

Further sketching out what 'prosaic interpretability' could look like.

Note; in the parlance below 'model' refers to a theory / idea / intuition, not an LLM 

  • In my current vision, 'prosaic interpretability' will stitch together a swathe of different empirical results into a broader fabric of 'LLM behavioural science'.
  • This will in turn yield gearsy models like Jan Kulveit's three layer model of LLM behavioural science or the Waluigi effect. Elucidating and refining these 'gearsy models' are the explicit goal of prosaic interpretability.
    • My current view is that most people who do empirical alignment research (like Ethan Perez's team) already have a collection of such intuitions that drive experiment design and hypothesis formulation, but mostly don't publish these (e.g. because they are less defensible), and prefer to focus on publishing object-level findings.
  • The goal is to be "nimble"; constantly adapt and update. This seems like the ~only way to keep pace with the quickly-evolving reality of frontier AI systems.
  • Empiricism will be a key tenet of this kind of work. Shamelessly steal and adopt any model that makes good predictions, ruthlessly discard any model that doesn't. Systematized winning, applied at scale to the fundamental questions in interpretability.
  • Extensive theorising should be considered suspect by default. Complex models are more likely to be wrong. Also, developing extensive theory seems unnecessary if / when everything can change in the span of months. Ultimately, a history of science shows that the best models are simple ones. 

This sounds like a plausible story for how (successful) prosaic interpretability can help us in the short to medium term! I would say though, I think more applied mech interp work could supplement prosaic interpretability's theories. For example, the reversal curse seems mostly explained by what little we know about how neural networks do factual recall. Theory on computation in superposition help explain why linear probes can recover arbitrary XORs of features.

Reading through your post gave me a chance to reflect on why I am currently interested in mech interp. Here's a few points where I think we differ:

  • I am really excited about fundamental, low-level questions. If they let me I'd want to do interpretability on every single forward pass and learn the mechanism of every token prediction.
  • Similar to above, but I can't help but feel like I can't ever truly be "at ease" with AIs unless we can understand them at a level deeper than what you've sketched out above.
  • I have some vague hypothesis about what a better paradigm for mech interp could be. It's probably wrong, but at least I should try look into it more.
    • I'm also bullish on automated interpretability conditional on some more theoretical advances.

Best of luck with your future research!

Thanks for the kind words! 

I think I mostly agree, but am going to clarify a little bit: 

Similar to above, but I can't help but feel like I can't ever truly be "at ease" with AIs unless we can understand them at a level deeper than what you've sketched out above.

I also believe this (I was previously interested in formal verification). I've just kind of given up on this ever being achieved haha.  It feels like we would need to totally revamp neural nets somehow to imbue them with formal verification properties. Fwiw this was also the prevailing sentiment at the workshop on formal verification at ICML 2023. 

I have some vague hypothesis about what a better paradigm for mech interp could be. It's probably wrong, but at least I should try look into it more.

That's pretty neat! And I broadly agree that this is what's going on. The problem (as I see it) is that it doesn't have any predictive power, since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).  

I'm also bullish on automated interpretability conditional on some more theoretical advances.

I sincerely want auto-interp people to keep doing what they're doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this. (But I argue that if you aren't currently heavily invested in pushing mech interp then you probably shouldn't invest more.) 

Best of luck with your future research!

Thanks again! You too :) 

I sincerely want auto-interp people to keep doing what they're doing! It feels like they probably already have significant momentum in this direction anyway, and probably some people should be working on this.

Mostly just agreeing with Daniel here, but as a slightly different way of expressing it: given how uncertain we are about what (if anything) will successfully decrease catastrophic & existential risk, I think it makes sense to take a portfolio approach. Mech interp, especially scalable / auto-interp mech interp, seems like a really important part of that portfolio.

since it's near-impossible to identify which specific heuristic the model is using (there can always be a slightly more complex, more general heuristic of which your chosen heuristic is a special case).  

I'm putting some of my faith in low-rank decompositions of bilinear MLPs but I'll let you know if I make any real progress with it :)

How about "Cognitive Interpretability", or "AI Cognitive Psychology" (AI Cog Psych for short) rather than "Prosaic Interpretability"?

"Prosaic" conjures only some of the correct associations, and then only if you've heard of "Prosaic Alignment", which was a pretty bad name imho. If you had told me to guess what you meant by the term PI, I would not have guessed what you have described.

I think MI, and what you call PI, are analogous to Cognitive Neuroscience and Cognitive Psychology, respectively, which is why I think AI Cog Psych will lead to more correct inferences on first hearing.

I also suspect that Cognitive Psychology, especially the linguistics part, already has a wealth of methods that could transfer very nicely onto LLMs. For example, in The Language Instinct, Steven Pinker describes how it is possible to discover many things about how we parse sentences without any brain scans - solely through natural language experiments. He mentioned a bunch of other experiments I think could work quite well, or could at least help build intuitions for how to discover mental mechanisms solely through input output behaviour on carefully constructed sentences.

It also sounds cooler to say you work on AI Cognitive Psychology rather than Prosaic Interpretability ;)

By the way, the analogy with genes is fantastic. I think it nicely points to the fact that even if some features are relatively straightforward to find, circuits may nevertheless be fiendishly difficult to uncover. Thanks for writing such an excellent post and being honest about some of your hard work that didn't pan out how you hoped!

That book looks interesting! I’ll check it out

Atm I dislike ‘cog psych’ because it doesn’t evoke precise meaning for me and likely doesn’t for researchers w/o that background (which I guess is the majority). I do take the point that ‘prosaic’ may be a bad name though.

Glad you enjoyed the post!