SAE feature geometry is outside the superposition hypothesis
Written at Apollo Research Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models. Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too. Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them. What would it mean if we could fully understand an activation space through the lens of superposition? If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand. So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have gi
Upvoted. The way I think about the case for ambitious interp is:
- There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
- Ambitious interp should have a much
... (read more)