A Pragmatic Vision for Interpretability
Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative advantage * Measuring progress with empirical feedback on proxy tasks * We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us * Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact * Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp * See our companion piece for more on which research areas and theories of change we think are promising * Why pivot now? We think that times have changed. * Models are far more capable, bringing new questions within empirical reach * We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others [[2]] * Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought * Problem: It is easy to do research that doesn't make real progress. * Our approach: ground your work with a North Star - a meaningful stepping-stone goal towards AGI going well - and a proxy task - empirical feedback that stops you fooling yourself and that tracks progress toward the North Star. * “Proxy tasks” doesn't mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will st