I want to draw a distinction between two things:
I'm optimistic that we'll get more than zero out of ersatz interpretability, even as we scale to superintelligent AGI. As you point out, in model-based RL, just looking at the value function gives some nice information about how the AI is assessing its own prospects. That's trivial, and useful, I seem to recall that in the movie about AlphaGo, the DeepMind engineers were watching AlphaGo's value function to see whether it expected to win or lose.
Or if we want a strong but imperfect hint that our fancy AI is thinking about blueberries, we can just train a classifier to look at the whole array of network activations, where the supervisory signal is a simple ConvNet blueberry classifier with the same field-of-view as the fancy AI. It won't be perfect—maybe the AI is imagining blueberries without looking at them or whatever. But it won't be totally useless either. In fact I'm an advocate for literally having one such classifier for every word in the dictionary, at minimum. (See here, here (search for the words "more dakka"), and more discussion in a forthcoming post.)
At the same time, I'm somewhat pessimistic about "real interpretability", or at least I'm concerned that it will hit a wall at some point. There are concepts in Ed Witten's head when he thinks about string theory that I'm just not going to be able to really understand, beyond "this concept has something vaguely to do with this other concept which has something vaguely to do with axions" or whatever.
I think real and ersatz interpretability represent different points on a spectrum, representing different levels of completeness. Each model has a huge collection of factors that decide its behavior. Better explanations abstract away more of those factors in a way humans can understand and use to accurately predict model behavior. Worse explanations cover fewer factors and are less able to reliably predict model behavior.
I’m relatively optimistic about how far we can get with real interpretability. Much of that comes from thinking that we can get pretty far with approaches we currently consider extreme. E.g., I think we can do something like knowledge distillation from AIs to humans by feeding AI internal activations to human brains through channels with wider bandwidth than visual senses. I.e., either through the peripheral nervous system or (more riskily) directly via brain machine interface. So if you have an unknowable concept in an AI, you can target the knowledge distillation process at the concept and learn appropriate intuitions for representing and integrating the concept directly from the AI’s own representations.
I intend to further explore ideas in this space in a future post. Probably, I’ll title it “The case for optimism about radical interpretability”.
Am I correct in thinking the 'ersatz' and 'real' interpretability might differ in aspects more than just degree of interpretability-- Ersatz is somewhat embedded in explaining the typically case, whereas 'real interpretability' gives good reasoning eve in the worst-case. Interpretability might be hard to achieve in worst-case scenarios where some atypical wiring leads to wrong decisions?
Furthermore, I suspect confusing transparency for interpretability. Even if we understand what each-and-every-neuron does (radical transparency), it might not be interpretable if it seems gibberish.
If it seems correct so far, I elaborate on these here.
Newbie question: What's the difference between transparency and interpretability? Follow-up question: Does everyone agree with that answer or is it not standardized?
They're likely to be interchangeable, sorry. Here I might've misused the words to try tease out the difference that simply understanding how a given model works is not really insightful if the patterns are not understandable.
I think there are these nonsensical-seeming-patterns to humans might be a significant fraction of the learned patterns by deep networks. I was trying to understand the radical optimism, in contrast to my pessimism given this. The crux being since we don't know what these patterns are and what they represent, even if we figure out what neurons detect them and which tasks they contribute most to, might not be able to do downstream tasks we require transparency for, like diagnose possible issues and provide solutions.
I think you're pointing to a special case of a more general pattern.
Just like there's a general factor of "athletic ability" which can be subdivided into many correlated components, I think "interpretability" can be split into many correlated components. Some of those components correspond to greater reliability in the worst case scenarios. Others might correspond to, e.g., predicting which architectural modifications would improve performance on typical input.
Worst case reliability is probably the most important component to interpretability, but I'm not sure it makes sense to call it "real interpretability". It's probably better to just call it "worst case" interpretability.
I like your reservations (which I think are well-written, sensible, and comprehensive) more than your optimism. The wireheading example is a good illustration of a place where we can find something coarse-grained in the brain that almost matches a natural human concept, but not quite, and the caveats there would be disastrous to ignore if trying to put load-bearing trust in interpreting superhuman AI.
The point of the wireheading example is that, in order for investigators in the 1960s to succeed that much, the brain must be MUCH more interpretable than current artificial neural networks. We should be able to make networks that are even more interpretable than the brain, and we should have much better interpretability techniques than neuroscientists from the 1960s.
My argument isn't "we can effectively adapt neuroscience interpretability to current ML systems". It's "The brain's high level interpretability suggests we can greatly improve the interpretability of current ML systems".
Sure. In fact, if you just want to make ML models as interpretable as humans, almost all of what I labeled "reservations" are in fact reasons for optimism. My take is colored by thinking about using this for AI alignment, which is a pretty hard problem. (To restate: you're right, and the rest of this comment is about why I'm still pessimistic about a related thing.)
It's kinda like realizability vs. non-realizability in ML. If you have 10 humans and 1 is aligned with your values and 9 are bad in normal human ways, you can probably use human-interpretability tools to figure out who is who (give them the Voight-Kampff test). If you have 10 humans and they're all unaligned, the feedback from interpretability tools will only sometimes help you modify their brains so that they're aligned with you, even if we imagine a neurosurgeon who could move individual neurons around. And if you try, I expect you'd end up with humans that are all bad but in weird ways.
What this means for AI alignment depends on the story you imagine us acting out. In my mainline story we're not going to be in a situation where we need to building AIs that have some moderate chance of turning out to be aligned (and we then need to interpret their motivations to tell). We're either be going to be building AIs with a high chance of alignment or a very low chance, and we don't get much of a mulligan.
When interpreting human brains, we get plenty of excellent feedback. Calibrating a lie detector might be as easy as telling a few truths and a few lies while in an fMRI.
To be able to use similar approaches for interpreting AIs, it might be necessary to somehow get similar levels of feedback from the AIs. I notice I don't have the slightest idea whether feedback from an AI was a few orders of magnitude harder to get - compared to getting human feedback - or whether it would be a few orders of magnitude easier or about the same.
Can we instruct GPT-3 to "consciously lie" to us?
My guess is the initial steps of getting the models to lie reliably are orders of magnitude harder than just asking a human to lie to you. Once you know how to prompt models into lying, it's orders of magnitude easier to generate lots of lie data from the models.
I think many people here are already familiar with the circuits line of research at OpenAI. Though I think it's now mostly been abandoned
I wasn’t aware that the circuits approached was abandoned. Do you know why they abandoned it?
I can only speculate, but the main researchers are now working on other stuff, like e.g. Anthropic. As to why they switched, I don't know. Maybe they were not making progress fast enough or Anthropic's mission seemed more important?
However, at least Chris Olah believes this is still a tractable and important direction, see the recent RFP by him for Open Phil.
I spent a few months in late 2021/early 2022 learning about various alignment research directions and trying to evaluate them. Quintin's thoughtful comparison between interpretability and 1960s neuroscience in this post convinced me of the strong potential for interpretability research more than I think anything else I encountered at that time.
I don't think the analogy to biological brains is quite as strong. For example, biological brains need to be "robust" not only to variations in the input, but also in a literal sense, to forceful impact or to parasites trying to control it. It intentionally has very bad suppressability, and this means there needs to be a lot of redundancy, which makes "just stick an electrode in that area" work. More generally, it is under many constraints that a ML system isn't, probably too many for us to think of, and it generally prioritizes safety over performance. Both lead away from the sort of maximally efficient compression that makes ML systems hard to interpret.
Analogously: Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything, and trying to understand what it does just from reading the code would be very hard, and giving a compressed explanation of how it does that would be impossible. In practice, we don't write code like that, because we face constraints like those mentioned above - but its very easy to imagine that some optimization-based "automatic coder" would program like that. Indeed, on the occasion that we need to really optimize runtimes, we move in that direction ourselves.
So I don't think brains tell us much about the interpretability of the standard, highly optimized neural nets.
Firstly, thank you for your comment. I'm always glad to have good faith engagement on this topic.
However, I think you're assuming the worst case scenario in regards to interpretability.
biological brains need to be "robust" not only to variations in the input, but also in a literal sense, to forceful impact
Artificial networks are often trained with randomness applied to their internal states (dropout, gradient noise, etc). These seem like they'd cause more internal disruption (and are FAR more common) than occasional impacts.
or to parasites trying to control it
Evolved resistance to parasite control seems like it should decrease interpretability, if anything. E.g., having multiple reward centers that are easily activated is a terrible idea for resisting parasite control. And yet, the brain does it anyways.
It intentionally has very bad suppressability
No it doesn't. One of my own examples of brain interpretability was:
A recent paper was able to reversibly disable conditioned fear responses from mice.
Which is a type of suppressability. Various forms of amnesia, aphasia and other perceptual blocks are other examples of suppressability in one form or another.
it generally prioritizes safety over performance
So more reliable ML systems will be more interpretable? Seems like a reason for optimism.
Imagine a programmer would write the shortest program that does a given task. That would be terrible. It would be impossible to change anything without essentially redesigning everything
Yes, it would be terrible. I think you've identified a major reason why larger neural nets learn more quickly than smaller nets. The entire point of training neural nets is that you constantly change each part of the net to better process the data. The internal representations of different circuits have to be flexible, robust and mutually interpretable to other circuits. Otherwise, the circuits won't be able to co-develop quickly.
One part of Knowledge Neurons in Pretrained Transformers I didn't include (but probably should have) is the fact that transformers re-use their input embedding in their internal knowledge representations. I.e., the feed forward layers that push the network to output target tokens represent their target token using the input embeddings of the target tokens in question. This would be very surprising if you thought that the network was just trying to represent its internal states using the shortest possible encoding. However, it's very much what you'd expect if you thought the internal representations needed to be easily legible across many different parts of the network.
I'll present a more extensive exploration of the "inner interpretability problem" in a future post arguing that there are deep theoretical reasons why easy to optimize systems also tend to be interpretable, and that highly optimized neural nets will learn to make themselves more interpretable.
Summary:
Definitional note: for this essay, "interpretable" serves as a catch-all for things like "we can abstractly understand the algorithms a system implements" and "we can modify internal states to usefully influence behavior in a predictable manner". Systems are more interpretable when it's easier to do things like that to them.
1: Strong, interpretable systems are possible
Imagine an alien gave you a collection of AI agents. The agents vary greatly in capability, from about as smart as a rat to human level (but not beyond). You don't know any details about the AI's internal algorithms, only that they use something sort of like neural nets. The alien follows completely inhuman design principles, and has put no effort whatsoever into making the AIs interpretable.
You are part of a larger community studying the alien AIs, and you have a few decades to tackle the problem. However, you operate under several additional constraints:
Please take a moment to predict how far your interpretability research would progress.
Would you predict any of the following?
If you base your predictions on how little progress we've made in ML interpretability, these advances seem very unlikely, especially given the constraints described.
And yet, in 1953, James Olds and Peter Milner discovered that running a current through certain regions in rats' brains was rewarding and influenced their behavior. In 1963, Robert Heath showed similar results in humans. Brain reward stimulation works in every vertebrate so far tested. Since the 1960s, we've had more advances in brain interpretability:
The lie detection research is particularly interesting to me, given its relation to detecting deception/misalignment in ML systems. I know that polygraphs are useless, but that lots of studies claim to show otherwise. I'm not very familiar with lie detection based on richer brain activity data such as fMRI, EEG, direct brain recordings, etc. If anyone is more familiar with this field and able to help, I'd be grateful, especially if they're able to point out some issue in the research I've linked above that reveals it to be much less impressive than it looks.
2: Why does current interpretability lag?
These brain interpretability results seem much more impressive to me than I'd have expected given how hard ML interpretability has been and the many additional challenges involved with studying brains. I can think of four explanations for such progress:
Evolution made no effort to ensure brains are interpretable. And yet, we've made startling progress in interpreting them. This suggests mind design space contains high-performance, very interpretable designs that are easily found. We should be able to do even better than evolution by explicitly aiming for interpretable systems.
3: Interpretability for current ML systems
There's a widely held preconception that deep learning systems are inherently uninterpretable. I think this belief is harmful because it provides people with an excuse for not doing the necessary interpretability legwork for new models and discourages people from investigating existing models.
I also think this belief is false (or at least exaggerated). It's important not to overstate the interpretability issues with current state of the art models. Even though the field has put roughly zero effort into making such models interpretable, we've still had some impressive progress in many areas.
4 Measuring/improving interpretability for current systems
Knowledge Neurons in Pretrained Transformers is particularly interesting because they present something like a metric for interpretability. Figure 3 shows the relative change in probability mass assigned to correct answers after using interpretability tools to suppress the associated knowledge (similarly, Figure 4 shows the effects of amplifying knowledge).
We could use the average of these suppression ratios as a measure for interpretability (more precisely, amenability to current interpretability techniques). Then, we could repeatedly retrain the language model with different architectures, hyper parameters, regularizers, etc, and test their effects on the model's "suppressability".
We can also test interventions aimed explicitly at improving interpretability. For example, we could compute the correlations between neuron activations, then include a term in the training loss that encourages sparse correlations. Ideally, this prompts the model to forms small, sparsely connected internal circuitry. However, it could also prompt the neurons to have more complex, second order relationships that the correlation term doesn't penalize.
Furthermore, suppressability is (as far as I can tell), a differentiable metric. It uses model gradients to attribute predictions to particular neurons, then suppresses the neurons with highest attribution for a given prediction. We can use meta-gradients to get the gradients of the attributions, then try to maximize the difference between the suppressed and normal model outputs as our suppressability training objective.
This will hopefully encourage the model to put maximum attribution on the neurons most responsible for particular pieces of knowledge, making the network more interpretable. In an ideal world, the network will also make more of its internal computation legible to the suppression technique because doing so allows the suppression technique to have a greater impact on model outputs. Of course, explicitly training for this interpretability metric may cause the network to Goodhart the metric. Ideally, we'd have additional, independent interpretability metrics to compare against.
Running interpretability experiments would also allow us to explore possible tradeoffs between interpretability and performance. I think we'll find there's a lot of room to improve interpretability with minimal performance loss. Take dropout for example: "The Implicit and Explicit Regularization Effects of Dropout" disentangles the different regularization benefits dropout provides and shows we can recover dropout's contributions by adding a regularization term to the loss and noise to the gradient updates.
We currently put very little effort into making state of the art systems interpretable. It's often possible to make large strides in under-researched areas. The surprising ease of brain interpretability suggests there are accessible ML approaches with high performance which are more interpretable than the current state of the art. Since we actually want interpretable systems, we should be able to do better still. Surely we can beat evolution at a goal it wasn't even trying to achieve?