[Blog] [Paper] [Visualizer]
Abstract:
Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release code and autoencoders for open-source models, as well as a visualizer.
Well done and thank you! I don't feel qualified to judge exactly but this seems like a significant step forward. Curious to hear your thoughts on the question of "by what year will [insert milestone X] be achieved assuming research progress continues on-trend." Some milestones perhaps are in this tech tree https://www.lesswrong.com/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree but the one I'm most interested in is the "we have tools which can tell whether a model is scheming or otherwise egregiously misaligned, though if we trained against those tools they'd stop working" milestone.
To add some more concreteness: suppose we open up the model and find that it's basically just a giant k nearest neighbors (it obviously can't be literally this, but this is easiest to describe as an analogy). Then this would explain why current alignment techniques work and dissolves some of the mystery of generalization. Then suppose we create AGI and we find that it does something very different internally that is more deeply entangled and we can't really make sense of it because it's too complicated. Then this would imo also provide strong evidence that... (read more)