How are activations in a transformer clustered together and what can we learn?
There has been a lot of progress using unsupervised methods (such as sparse autoencoders) to find monosemantic features in LLMs. However, is there a way that we can interpret activations without breaking them down into features?
The answer is yes. I contend that clustering activations (from a large dataset of examples) reveals strong interpretability. The process is simple. Choose any example and then find a set of activations that are L2 closest at your favorite activation stage. Semantic meaning of that stage can then be inferred by recognizing similarities in the examples that make up that set.
In some ways, cluster analysis... (read more)