TL;DR * We train sparse autoencoders (SAEs) on artificial datasets of 2D points, which are arranged to fall into pre-defined, visually-recognizable clusters. We find that the resulting SAE features are interpretable as a clustering algorithm via the natural rule "a point is in cluster N if feature N activates on...
These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. TL;DR In roughly descending order of importance: 1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features....
Introduction A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question: This is a cute magic trick, and like any good trick they nonchalantly gloss over...
[3/7 Edit: I have rephrased the bolded claims in the abstract per this comment from Joseph Bloom, hopefully improving the heat-to-light ratio. Commenters have also suggested training on earlier layers and using untied weights, and in my experiments this increases the number of classifiers found, so the headline number should...
9 months ago I predicted trends I expected to see in AI over the course of 2023. Here's how I did (bold indicates they happened, italics indicates they didn't, neither-bold-nor-italics indicates unresolved): 1. ChatGPT (or successor product from OpenAI) will have image-generating capabilities incorporated by end of 2023: 70% 2....
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or...
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI),...