Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).
I guess this proves the superiority of the mechanistic interpretability technique "note that it is mechanistically possible for your model to say that things are gorillas" :P
Fair enough if you're interested in just talking about 'approaches to acquiring information wrt. AIs' and you'd like to call this interpretability.
There are not that many that I don't think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...
Outside of the interpretability research space, do you know of other interesting examples of different techniques being graded on different curves?
Electric vehicles? Early electric vehicles were worse than gas cars on all axis other than the theoretical promise of the technology. However, they were (and still are, ie formula E) graded on separate curves. The fairly straight-forward analogy I'm trying to make is that maybe it's worthwhile treating early technologies gently, as now I think most people are pretty impressed by electric cars.
Although obviously there are significant differences here (consumer market vs helping engineers, etc), I think this could be a useful metaphor to try out arguments in these sequences on to judge their reasonableness.
Part 2 of 12 in the Engineer’s Interpretability Sequence.
A parable based on a true story
Remember Google’s infamous blunder from 2015 in which users found that one of its vision APIs often misclassified black people as gorillas? Consider a parable of two researchers who want to understand and tackle this issue.
The goal of this parable is to illustrate that when it comes to doing useful engineering work with models, a mechanistic understanding may not always be the best way to go. We shouldn’t think of something called “interpretability” as being fundamentally separate from other tools that can help us accomplish our goals with models. And we especially shouldn’t automatically privilege some methods over others. In some cases, highly involved and complex approaches may be necessary. But in other cases like Alice’s, the interesting, smart, and paper-able solution to the problem might not only be harder but could also be more failure-prone. This isn’t to say that Alice’s work could never lead to more useful insights down the road. But in this particular case, Alice’s smart approach was not as good as Bob’s simple one.
Interpretability is a means to an end.
Since I work on and think about interpretability every day, I have felt compelled to adopt a definition for it. In a previous draft of this post, I proposed defining an interpretability tool as “any method by which something novel about a system can be better predicted or described.” And I think this is ok, but I have recently stopped caring about any particular definition. Instead, I think the important thing to understand is that 'interpretability’ is not a term of any fundamental importance to an engineer.
The key idea behind this post is that whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc.
Does this mean that it’s the same thing as interpretability if we just calculate performance on a test set, train an adversarial example, do some model pruning, or make a prediction based on the dataset? Pretty much. For all practical intents and purposes, these things are all of a certain common type. Consider any of the following sentences.
All of these things are perfectly valid insights to use if they help us learn something we want to learn about a model or do something we want to do with it.
If this seems like pushing for a hastily broad understanding, consider the alternative. Suppose that we think of interpretability as distinct from other tools – perhaps because we care a lot about mechanistic understandings. Then the concept becomes a fairly arbitrary and limiting term with respect to our goals for it. Krishnan (2020) argues against such a definition:
From an engineer’s perspective, it’s important not to grade different classes of solutions each on different curves. Any practical approach to interpretability must focus on eventually producing actionable insights that help us better design, develop, or deploy models. Anything that helps with this is fair game.
Mechanistic approaches to interpretability are not uniquely important for AI safety.
One objection to adopting a broad notion of interpretability might be that mechanistic notions of it seem uniquely useful for AI safety and hence worthy of unique attention. Mechanistic interpretability seems well-equipped for detecting ways that AI systems might secretly be waiting to betray us. To the extent that this is a concern (and it probably is a big one), wouldn’t that be a good reason to think of mechanistic interpretability separately from other approaches to engineering better models?
At this point, a debate over definitions is of fairly little importance. As long as we are clear and non-myopic, that’s all fine. But there are still two key things to emphasize.
First, there are many ways for AI to cause immense harm that don’t involve deceptive alignment. Recall that deceptive alignment failures are a subset of inner alignment failures which are a subset of alignment failures which are a subset of safety failures. When it comes to issues that don’t stem from deception, we definitely should not restrict ourselves to mechanistic interpretability work.
Second, mechanistic interpretability is not uniquely useful for deceptive alignment. Better understanding how to address it will involve some further unpacking of the term “deception.” But for that discussion, please wait for EIS VIII :)
Questions