Self-explaining SAE features
TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. * The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia’s auto-interp labels (with the caveat that Neuronpedia’s auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison). * We aren’t confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating dataset examples are needed, and it’s cheaper as you just run the model being studied (eg Gemma 2B) not a larger model like GPT-4. * Further, it has different errors to auto-interp, so finding and reading both may be valuable for researchers in practice. * We provide advice for using self-explanation in practice, in particular for the challenge of automatically choosing the right scale, which significantly affects explanation quality. * We also release a tool for you to work with self-explanation. * We hope the technique is useful to the community as is, but expect there’s many optimizations and improvements on top of what is in this post. Introduction This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy. SAE features promise a flexible and extensive framework for interpretation of LLM internals. Recent work (like Scaling Monosemanticity) has shown that they are capable of capturing even high-level abstract concepts inside the model. Compared to MLP neurons, they can capture many more interesting concepts. Unfortunately, in order to learn things with SAE features and interpret what the SAE tells us, one needs to first interpret these features