Work done @ SERI-MATS.
This is the first in a short series of short posts about interpretability. In this post, I'm collecting some thoughts on why model agnostic interpretability is a worthwhile pursuit. I'll assume that the reader is sympathetic to arguments for interpretability in general. If you're not, maybe Neel can help.
Model agnostic interpretability methods are those which treat the model in question as a black box. They don't require access to gradients or activations, and make no assumptions about the model's architecture. The model inside could be a support vector machine; a deep neural network; a reinforcement learning agent; a set of water filled pipes; or a human in a box with a set of instructions: any system that produces some output in response to some input. This is in contrast to model specific interpretability methods, which either require access to the internal state of the model, or make assumptions about its architecture.
Model agnostic interpretability methods are entirely perturbation-based, meaning that they consist of various different ways of changing the input, and looking at how the output changes (what else is there to do?). It turns out that there are many ways to do this, and I will refer you to other excellent overviews rather than reiterating them here.
Here's an example of perturbation-based saliency mapping, a model agnostic interpretability method. Parts of the input are iteratively perturbed, and the resulting changes in the logit for the class 'dog' are mapped to the location of those perturbations.
Some of these methods (like perturbation-based saliency mapping) work with any kind of data. You could perform the same kind of iterative perturbation upon time-series, or text, or tabular inputs, or RL environments in a pretty straightforward manner. Other methods (like feature visualisation) rely on a searchable input space, which makes them harder to apply to arbitrary input types (although I suspect not impossible – more on that in an upcoming post).
Model agnostic methods have some nice properties:
- They are able to provide comparisons across different models with different architectures, with low engineering overhead.
- They are able to capture gestalt, global phenomena in model behaviour, in a way that local circuits-style interpretability is not.
- Most importantly, they are robust to paradigm shifts in model architecture.
This last property is the one I'm most interested in. What if the looming AGI that keeps us up at night is not GPT-X, but some other architecture that our current interpretability methods won't transfer to? What if all the excellent people doing excellent interpretability work right now are building and learning things that will turn out to be irrelevant? Is this a legitimate concern?
Some questions:
1) How difficult is it to adapt model specific interpretability methods to arbitrary novel architectures? I plan to spend some research time on this in the near future. If it's quite difficult, then working on model agnostic methods is important. My intuition is that adapting existing model specific interpretability methods is probably non-trivial, and that's if we assume that the novel architecture is similar in kind: i.e. feed-forward neural networks trained using gradient descent.
2) How likely are we to see a paradigm shift in model architectures (that leads to AGI) large enough to break existing interpretability methods? (And, how long will we have before such a shift results in dangerous AGI? Will we have time to develop model specific interpretability methods for the new paradigm?) If this is likely (or, given the stakes, just possible), then working on model agnostic methods is important. I'm quite uncertain about this, and I expect opinions to differ widely, probably strongly correlated with timelines.
It seems to me that there is a moderately strong case to be made for allocating resources to this kind of work, if the answer to question one is 'non-trivial' and the answer to question two is at least 'somewhat likely'. I think these are reasonable answers (and, no-one else seems to be doing model agnostic interpretability research) – so here I am.
Thanks Jessica!
I like 1) and think this is worth doing. I believe that Mechanistic Interpretability researchers are already somewhat concerned about insight not generalising from toy models to larger models let alone to novel architectures so work on model agnostic levels could be useful in the same paradigm too.
Something to note, I'm not confident about the track record of model agnostic methods (such as saliency maps). I've heard from at least one ML researcher that saliency maps have a poor track record and have been shown to be unreliable in a variety of experiments. Do you know of any other examples of model agnostic interpretability methods which you think might be very useful? Maybe saliency maps don't matter as much as the idea of model agnostic methods in which case feel free to disregard this. I've heard before of interest in generally approaching models as block boxes "ML psychologist" while we try to understand them so don't think the value of this approach lies too heavily in specific prior methods.
With respect to 2), while I think this is reasonable, I believe the salient point is whether models from the current paradigm are sufficiently dangerous fast enough that they warrant more/less focus. Theoretically, the space of possible ML architecture paradigms producing doom is large and the order in which they will manifest is roughly the order in which we should solve them. (ie: align current systems, then new paradigm systems, then new paradigm systems, each buying time).
However, I think there are good enough reasons to work on model agnostic methods that don't rely on AGI doom originating in a new paradigm.
Overall, very exciting! good luck!
Hi Joseph! I'll briefly address the saliency map concern here – it likely originates from this paper, which showed that some types of saliency mapping methods had no more explanatory power than edge detectors. It's a great paper, and worth a read. The key thing to note is that this was only true of some gradient-based saliency mapping methods, which are, of course, model-specific. Gradients can be deceptive! Model agnostic, perturbation-based saliency mapping doesn't suffer from the same kind of problems – see p.12 here.