Robert_AIZI

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability. Paper Overview Sparse Autoencoders & Superposition To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking at neurons is insufficient. Also, for some types of network activations, like the residual stream of a transformer, there is little reason to expect features to align with the neuron basis so we don't even have a good place to start. Overview of the methodology. Toy Models of Superposition investigates why polysemanticity might arise and hypothesise that it may result from models learning more distinct features than there are dimensions in the layer, taking advantage of the fact that features are sparse, each one only being active a small proportion of the time. This suggests that we may be able to recover the network's features by finding a set of directions in activation space such that each activation vector can be reconstructed from a sparse linear combinations of these directions. We attempt to reconstruct these hypothesised network features by training linear autoencoders on model activation vectors. We use a sparsity penalty on the embedding, and tied weights between the encoder and decoder, training the models on 10M to 50M activation vectors each. For more detail on the methods used, see the paper. Automatic Interpretation We use th

159Sep 21, 2023

Robert_AIZI

Message

1396

SAEs you can See: Applying Sparse Autoencoders to Clustering

TL;DR * We train sparse autoencoders (SAEs) on artificial datasets of 2D points, which are arranged to fall into pre-defined, visually-recognizable clusters. We find that the resulting SAE features are interpretable as a clustering algorithm via the natural rule "a point is in cluster N if feature N activates on...

Oct 28, 202427

Comments on Anthropic's Scaling Monosemanticity

These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. TL;DR In roughly descending order of importance: 1. Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features....

Jun 3, 202498

Explaining a Math Magic Trick

Introduction A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question: This is a cute magic trick, and like any good trick they nonchalantly gloss over...

May 5, 2024103

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

[3/7 Edit: I have rephrased the bolded claims in the abstract per this comment from Joseph Bloom, hopefully improving the heat-to-light ratio. Commenters have also suggested training on earlier layers and using untied weights, and in my experiments this increases the number of classifiers found, so the headline number should...

Mar 5, 202461

Rating my AI Predictions

9 months ago I predicted trends I expected to see in AI over the course of 2023. Here's how I did (bold indicates they happened, italics indicates they didn't, neither-bold-nor-italics indicates unresolved): 1. ChatGPT (or successor product from OpenAI) will have image-generating capabilities incorporated by end of 2023: 70% 2....

Dec 21, 202322

Comparing Anthropic's Dictionary Learning to Ours

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or...

Oct 7, 2023137

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Sep 21, 2023159

Load More (7/36)

LESSWRONG
LW

LESSWRONG
LW

Robert_AIZI

Robert_AIZI

Robert_AIZI

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Comparing Anthropic's Dictionary Learning to Ours

Explaining a Math Magic Trick

Comments on Anthropic's Scaling Monosemanticity

Robert_AIZI

SAEs you can See: Applying Sparse Autoencoders to Clustering

Comments on Anthropic's Scaling Monosemanticity

Explaining a Math Magic Trick

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Rating my AI Predictions

Comparing Anthropic's Dictionary Learning to Ours

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

SAEs you can See: Applying Sparse Autoencoders to Clustering

Comments on Anthropic's Scaling Monosemanticity

Explaining a Math Magic Trick

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Rating my AI Predictions

Comparing Anthropic's Dictionary Learning to Ours

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Comparing Anthropic's Dictionary Learning to Ours

Explaining a Math Magic Trick

Comments on Anthropic's Scaling Monosemanticity