Here is the promised Colab notebook for exploring SAE features with TDA. It works on the top-k GPT2-small SAEs by default, but should be pretty easily adaptable to most SAEs available in sae_lens. The graphs will look a little different from the ones shown in the post because they are constructed directly from the decoder weight vectors rather than from feature correlations across a corpus.
One of the interesting things I found while putting this together is a large group of "previous token" features, which are mostly misinterpreted by the LLM-generated exp... (read more)
Here is the promised Colab notebook for exploring SAE features with TDA. It works on the top-k GPT2-small SAEs by default, but should be pretty easily adaptable to most SAEs available in
sae_lens
. The graphs will look a little different from the ones shown in the post because they are constructed directly from the decoder weight vectors rather than from feature correlations across a corpus.One of the interesting things I found while putting this together is a large group of "previous token" features, which are mostly misinterpreted by the LLM-generated exp... (read more)