Summary
This post discusses a possible way to detect feature absorption: find SAE latents that (1) have a similar causal effect, but (2) don't activate on the same token. We'll discuss the theory of how this method should work, and we'll also briefly go over how it doesn't work in practice.
Introduction
Feature absorption was introduced in A is for Absorption.
Their example: Assume the model has two features, 𝒜="this token is 'short'" and ℬ="this word starts with s". In the prompt "Sam is short", 𝒜 activates on "short", and ℬ activates on "Sam" and "short". Hence the L0 across those two features is 3. We train an SAE to recover these features.
But the SAE is... (read 1594 more words →)