Thanks for writing out your thoughts on this! I agree with a lot of the motivations and big picture thinking outlined in this post. I have a number of disagreements as well, and some questions:
It's unfortunate that mech interp inherits the CNC paradigm, because despite many years of research, turns out it's really hard to do computational science on brains, so computational neuroscience hasn't made a huge amount of progress.
I strongly agree with this, and I hope more people in mech. interp. become aware of this. I would actually emphasize that in my opinion it's not just that it's hard to do computational science on brains, but that we don't have the right framework. Some weak evidence for this is exactly that we have an intelligent system that has existed for a few years now where experiments and analyses are easy to do, and we can see how far we've gotten with the CNC approach.
My main point of confusion with this post has to do with Parameter Decomposition as a new paradigm. I haven't thought about this technique much, but on a first reading it doesn't sound all that different from what you call the second wave paradigm, just replacing activations with parameters. For instance, I think I could take most of the last few sections of this post and rewrite it to make the point. Just for fun I'll try this out here, trying to argue for a new paradigm called "Activation Decomposition". (just to be super clear I don't think this is a new paradigm!)
You wrote:
Parameter Decomposition makes some different foundational assumptions than used by the Second-Wave.One of these assumptions arises because Parameter Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here defined with reference to mechanisms, which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Parameter Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.
and I'll rewrite that here, putting my changes in bold:
Activation Decomposition makes some different foundational assumptions than used by the Second-Wave.One of these assumptions arises because Activation Decomposition offers a clear definition of 'feature' that hitherto eluded Second-Wave Mech Interp. What Second-Wave Mech Interp calls a 'feature' can be defined as ‘properties of the input that activate particular mechanisms’. Notably, 'features' are here [in Activation Decomposition] defined with reference to mechanisms [which are circuits of linearly decomposed activations], which is great, because 'mechanisms' has a specific formal definition!
This re-definition implies a difference in one of the foundational assumptions of Second-Wave Mech Interp that ‘features are the fundamental unit of neural network’. Activation Decomposition rejects this idea and contends that ‘mechanisms are the fundamental unit of neural networks’.
Perhaps a simpler way to say my thought is, isn't the current paradigm largely decomposing activations? If that's the case why is decomposing parameters so fundamentally different?
I think maybe one thing that might be going on here is that people have been quite sloppy (though I think it's totally excusable and arguably even a good idea to be sloppy about these particular things given the current state of our understanding!), with words like feature, representation, computation, circuit, etc. Like I think when someone writes "features are the fundamental unit of neural networks" they are often meaning something closer to "representations are the fundamental unit of neural networks" or maybe something closer to "SAE latents are the fundamental unit of neural networks" and importantly, an implicit "and representations are only really representations if they are mechanistically relevant." Which is why you see interventions of various types in current paradigm mech interp papers.
Some work in this early period was quite neuroscience-adjacent; as a result, despite being extremely mechanistic in flavour, some of this work may have been somewhat overlooked e.g. Sussillo and Barak (2013).
This is a nitpick, and I don't think any of your main points rests on this, but I think the main reason this work was not used in any type of artificial neural network interp work at that time was that it is fundamentally only applicable to recurrent systems, and probably impossible to apply to e.g. standard convolutional networks. It's not even straightforward to apply to a lot of the types of recurrent systems used in AI today (to the extent they are even used), but probably one could push on that a bit with some effort.
As a final question, I am wondering what you think the implications for what people should be doing are if mech interp is or is not pre-paradigmatic? Is there a difference between mech interp being in a not-so-great-paradigm vs. pre-paradigmatic in terms of what your median researcher should be thinking/doing/spending time on? Or is this just an intellectually interesting thing to think about. I am guessing that when a lot of people say that mech interp is pre paradigmatic they really mean something closer to "mech interp doesn't have a useful/good/perfect paradigm right now". But I'm also not sure if there's anything here beyond semantics.
After thinking about this a bit more, the main point I’d want to make is less about recursive self improvement and more that just there’s a lot more capability in these models than people realize.
Whether that capacity is enough for recursive self improvement is another question that I’m not certain about either way but I think it’s at least plausible that it might be. I will note that humanity improves in its knowledge and capability without architectural change. That’s a rough analogy to the type of improvement I’m imagining.
i'm starting to think recursive self improvement is basically already possible with LLMs, even without anymore training ever. I'm pretty shocked with how much better my coding LLMs have become just by taking care to give the LLMs the right meta-context and information systems. I feel like I've moved from prompting, to figuring out what context is needed in addition to the prompt, to spending a bunch of time/effort building a knowledge structure so that the LLM can figure out its own context to get whatever done, and thats moved me from having LLMs write functions and scripts to large multi-file chunks of entire repositories. And I'm continually having the thought that's like "ok but now I'm building this knowledge system that it can traverse and decide its own relevant context, but why can't the LLM do that too? what would i need to setup for it to do that?" and i'm starting to feel like that's a never ending thing.
Thanks for writing this! I have been thinking about many of the issues in your Why Won't Interpretability Be Reliable section lately, and mostly agree that this is the state of affairs. I often think of this from the perspective of the field of neuroscience. My experience there (in the subsection of neuro research that I believe is the most analogous to mech interp) is that these are basically the same fundamental issues that keep the field from progress (though not the only reasons).
Many in the interpretability field seem to (implicitly) think that if you took neuroscience and made access to neural activities a lot easier, and the ability to arbitrarily intervene on the system, and the ability to easily run a lot more experiments, then all of neuroscience would be solved. From that set of beliefs if follows that because neural networks don't have these issues, mech interp will have the ability to more or less apply the current neuroscience approach to neural networks and "figure it all out." While these points about ease of experiments and access to internals are important differences between neuro. research and mech. interp., I do not think they get past the fundamental issues. In other words - Mech. interp. has more to learn from neuroscience failures than its successes (public post/rant coming soon!).
Seeing this post from you makes me positively update about the ability of interp. to contribute to AI Safety - it's important we see clearly the power and weaknesses of our approaches. A big failure mode I worry about is being overconfident that our interp. methods are able to catch everything, and then making decisions based on that overconfidence. One thing to do about such a worry is to put serious effort into understanding the limits of our approaches. This of course does happen to some degree already (e.g. there's been a bunch of stress testing of SAEs from various places lately), which is great! I hope when decisions are made about safety/deployment/etc., that the lessons we've learned from those types of studies are internalized and brought to bear, alongside the positives about what our methods do let us know/monitor/control, and that serious effort continues to be made to understand what our approaches miss.
why I shouldn't waste my time chasing this.
\
Some reasons that come to mind very quickly:
- Patch clamp experiments usually take place in slices with artificial cerebrospinal fluid (ACSF). The ephys properties can vary widely based on the experimental prep (angle that slice was taken, the temperature, the specific recipe used for the ACSF, the quality of the patcher, etc. etc.
Under the assumption that capturing the ephys properties of single neurons is important for WBE, it still seems unlikely to me that scaling up patch clamping is a viable path to that. More likely to work would be trying to scale up voltage imaging.
(for the record I don't personally agree with that assumption, for overlapping reasons with what Steven Byrnes thinks).
I think this really depends on what "good" means exactly. For instance, if humans think it's good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our "good" mech interp to scheme more deceptively.
I'm guessing your notion of good must explicitly mean that this scenario isn't possible. But this really begs the question - how could we know if our mech interp has reached that level of goodness?
Thanks, this is helpful. I'm still a bit unclear about how to use the word/concept "amortized inference" correctly. Is the first example you gave, of training an AI model on (query, well-thought guess), an example of amortized inference, relative to training on (query, a bunch of reasoning + well-thought out guess)?
This all sounds very reasonable to me! Thanks for the response. I agree that we are likely quite aligned about a lot of these issues.