I keep coming back to the idea of interpreting the embedding matrix of a transformer. It’s appealing for several reasons: we know the entire data distribution is just independent probabilities of each logit, so there’s no mystery about what features are data features vs model features. We also know one sparse basis for the activations: the rows of the embedding. But that’s also clearly not satisfactory because the embedding learns something! The thing it learns could be a sparse overbasis of non-token features, but the story for this would have to be different to the normal superposition story which involves features being placed into superposition by model components after they are computed (I find this story suss in other parts of the model too).
SAEs trained on the embedding do pretty well, but the task is much easier than in other layers because the dataset is deceptively small. Nonetheless if the error was exactly zero, this would mean that a sparse overbasis is certainly real here (even if not the full story). If the error were small enough we may want to conclude that this is just training noise. Therefore I have some experiment questions that would start this off:
If we find that 1) random embeddings do a lot worse and 2) basis pursuit doesn’t lead to error nodes that tend to zero over training, then we’re in business: the embedding matrix contains important structure that is outside the superposition hypothesis. Is matrix binding going on? Are circles common? WHAT IS IT