Alex Makelov

Message

SAEs Discover Meaningful Features in the IOI Task

TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive...

Jun 5, 202415

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project. tl;dr: Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to...

Aug 29, 202377

tl;dr:

Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to a given task. However, features rarely occupy entire components: instead, we expect them to form non-basis-aligned subspaces of these components.

We show that the obvious generalization of activation patching to subspaces is prone to a kind of interpretability illusion. Specifically, it is possible for a 1-dimensional subspace patch in the IOI task to significantly affect predicted probabilities by activating a normally dormant pathway outside the IOI circuit. At the same time, activation patching the entire MLP layer where this subspace lies has no such effect. We call this an "MLP-In-The-Middle" illusion.

We show a simple mathematical model of how this situation may arise more generally, and a priori / heuristic arguments for why it may be common in real-world LLMs.

Introduction

The linear representation hypothesis suggests that language models represent concepts as meaningful directions (or subspaces, for non-binary features) in the much larger space of possible activations. A central goal of mechanistic interpretability is to discover these subspaces and map them to interpretable variables, as they form the “units” of model computation.

However, the residual stream activations (and maybe even the neuron activations!) mostly don’t have a privileged basis. This means that many meaningful subspaces won’t be basis-aligned; rather than iterating over possible neurons and sets of neurons, we need to consider arbitrary subspaces of activations. This is a much larger search space! How can we navigate it?

A natural approach to check “how well” a subspace represents a concept is to use a subspace analogue of the activation patching technique. You run the model on input A, but with the activation along the subspace taken from an input B that differs from A only in the value of the concept in question. If the subspace encodes the information used by the model to distinguish B from A, we expect to see a corresponding change in model behavior (compared to just running on A).

Surpri...

LESSWRONG
LW

LESSWRONG
LW

Alex Makelov

Alex Makelov

SAEs Discover Meaningful Features in the IOI Task

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Alex Makelov

Alex Makelov

SAEs Discover Meaningful Features in the IOI Task

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

tl;dr:

Introduction