A Pragmatic Vision for Interpretability
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative advantage * Measuring progress with empirical feedback on proxy tasks * We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us * Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact * Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp * See our companion piece for more on which research areas and theories of change we think are promising * Why pivot now? We think that times have changed. * Models are far more capable, bringing new questions within empirical reach * We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others [[2]] * Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought * Problem: It is easy to do research that doesn't make real progress. * Our approach: ground your work with a North Star - a meaningful stepping-stone goal towards AGI going well - and a proxy task - empirical feedback that stops you fooling yourself and that tracks progress toward the North Star. * “Proxy tasks” doesn't mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will st
I think that most of the time when you need to classify something, you should use an LLM, not a probe.
That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:
1. Efficiency -> when done on policy, probes are extremely cheap and fast. For example, Anthropic's work on efficient misuse probes or our recent Gemini Probing paper.
2. Safety -> for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model's output, and probes (or other internals... (read more)