Work done @ SERI-MATS.
Evaluating interpretability methods (and so, developing good ones) is really hard because we have no ground truth. Or at least, no ground truth that we can compare our interpretations directly against.
The ground truth of a model's behaviour is provided by that model's architecture and its learned parameters. But, puny humans are unable to interpret this: it's precise, in that it accurately explains the model's behaviour, but it's not interpretable. On the other end of the spectrum we have something like "This model classifies cats" – a statement that is really easy to interpret, but lacks something in the way of precision.
Precise <---------------------------------> Interpretable
^ Useful?
Imagine two interpretations, each generated by a different method with respect to the same model (say, a cat classifier). Method A indicates that the model has learned to use ears and whiskers to identify cats. Method B indicates that it uses eyes and tails. Assuming both are easy to interpret, can we tell which method is most precise? Which most faithfully represents what the model is truly doing?
If we had a method that reconciled precision and interpretability, how would we know?
Well, we can perform sanity checks on the interpretability methods, and throw away any that fail them. This seems good – it's at least objective – but it only really allows us to throw away obviously bad approaches. It doesn't say anything about what to do when sane interpretability methods disagree.
We could also look at the interpretations and see if they appear sensible to us. This is a widely used approach (Zeiler et al., Petsuik et al., Fong et al., many many more), and I think it's a terrible idea.
Example:
- We've made some new interpretability method that is supposed to help us understand which words are used by a language model to identify hate speech in tweets. To see if it works properly, we compare the words highlighted by this interpretability method to the actual hateful words in the tweet. It gets them right! Our new interpretability method works!
- NO! We have fallen prey to a terrible assumption: that if a model performs well, it has learned to use the same features that a human would use. How we would perform a task is not the ground truth. How the model actually performs a task is, but we don't know that – it's what we're trying to find out!
Example:
- We use gradient descent to optimise the input to a model such that it maximises the activation of a particular node, layer, or logit. This works, but results in a really noisy input that doesn't make sense to us – it seems like adversarial noise. So, we regularise the input, perturb it intermittently during optimisation, constrain it to the training data distribution, and voila – we have a nice optimised input that makes sense! We found a fur detector!
- NOOOOOO! We've optimised the input to maximise the some output, sure. But we've also optimised it to maximise how much we like it. That's not what we wanted! That has nothing to do with what the model has actually learned, and how sensible an interpretation seems to us has no relation to the ground truth.
I'm being a bit dramatic. These kinds of approaches can be useful, and god knows I love a good feature visualisation, same as anyone. But I'm worried about using stuff like this to determine how good our interpretability methods are. It's not an objective evaluation.
A small idea: what if we did have access to the ground truth? If we had a small, simple model that we completely understood (I'm looking at you, mechanistic interpretability people), we could use it as a truly objective benchmark for other interpretability methods. (This is super easy for model agnostic saliency mapping – just use summation in place of the model, and then the ground truth saliency of each input element is exactly that element itself. If your saliency mapping method isn't exactly the same as the input, it's not working perfectly – and moreover, you can see exactly where it's failing.)
There we go!
So, one item on my list of posts to maybe get around to writing at some point is about what's missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.
The OP got about half the picture: current methods mostly don't have a good ground truth. People use toy environments to work around that, but then we don't know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.
The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.
Unpacking a bit more: "interpret" is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or .... Claim: the ground truth for an interpretability method is a True Name of whatever we're interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems - like e.g. modular addition and Fourier components.