Jakob Hansen

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Topological Data Analysis and Mechanistic Interpretability

Here is the promised Colab notebook for exploring SAE features with TDA. It works on the top-k GPT2-small SAEs by default, but should be pretty easily adaptable to most SAEs available in sae_lens. The graphs will look a little different from the ones shown in the post because they are constructed directly from the decoder weight vectors rather than from feature correlations across a corpus.

One of the interesting things I found while putting this together is a large group of "previous token" features, which are mostly misinterpreted by the LLM-generated explanations. These have been noted in attention SAEs (e.g. https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don), but I haven't seen much discussion of them, although they seem very relevant for implementing induction heads. The fact that they are grouped together in the graph makes sense if they are all computed by or used as the input to a single attention head, or more generally if there is some subspace of the residual stream reserved for this kind of information, although I haven't yet checked if this is the case.

Reply