Alex Makelov

Message

SAEs Discover Meaningful Features in the IOI Task

TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive...

Jun 5, 202415

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project. tl;dr: Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to...

Aug 29, 202377

LESSWRONG
LW

LESSWRONG
LW

Alex Makelov

Alex Makelov

Alex Makelov

Alex Makelov

SAEs Discover Meaningful Features in the IOI Task

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

SAEs Discover Meaningful Features in the IOI Task

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces