Here is the promised Colab notebook for exploring SAE features with TDA. It works on the top-k GPT2-small SAEs by default, but should be pretty easily adaptable to most SAEs available in sae_lens
. The graphs will look a little different from the ones shown in the post because they are constructed directly from the decoder weight vectors rather than from feature correlations across a corpus.
One of the interesting things I found while putting this together is a large group of "previous token" features, which are mostly misinterpreted by the LLM-generated explanations. These have been noted in attention SAEs (e.g. https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don), but I haven't seen much discussion of them, although they seem very relevant for implementing induction heads. The fact that they are grouped together in the graph makes sense if they are all computed by or used as the input to a single attention head, or more generally if there is some subspace of the residual stream reserved for this kind of information, although I haven't yet checked if this is the case.
No matter how I stretch or compress the digit 0, I can never achieve the two loops that are present in the digit 8.
0 when it's deformed by left and right pressure so that the sides meet seems to contradict?
Sorry, did not make the notion of deformation precise. The idea is that stretching and compressing cannot include attaching one part to another, or tearing it. The mathematical term is that of a "homeomorphism" , which is a one to one, onto, and continuous map. The precise statement is that the figure 8 is not homeomorphic to zero. A good place to look is
https://www.google.com/books/edition/Basic_Topology/NJbuBwAAQBAJ?hl=en&gbpv=1&printsec=frontcover
Yeah, I've met the concept during my studies and was rather teasing for getting a great popular, easy to grasp, explanation which would also fit the definition.
It's not easy to find a fitting visual analogy TBH, which I'd find generally useful as I hold the concept to enhance general thinking.
This article was written in response to a post on LessWrong from the Apollo Research interpretability team. This post represents our initial attempts at acting on the topological data analysis suggestions.
In this post, we’ll look at some ways to use topological data analysis (TDA) for mechanistic interpretability. We’ll first show how one can apply TDA in a very simple way to the internals of convolutional neural networks to obtain information about the “responsibilities” of the various layers, as well as about the training process. For LLM’s, though, simply approaching weights or activations “raw” yields limited insights, and one needs additional methods like sparse autoencoders (SAEs) to obtain useful information about the internals. We will discuss this methodology, and give a few initial examples where TDA helps reveal structure in SAE feature geometry.
I. Topological Data Modeling
The term topology refers to the study of shape using methods that are insensitive to deformations such as stretching, compressing, or shearing. For example, topology does not “see” the difference between a circle and an ellipse, but it does recognize the difference between the digit 0 and the digit 8. No matter how I stretch or compress the digit 0, I can never achieve the two loops that are present in the digit 8. Shapes can often be represented by graphs or their higher dimensional analogues called simplicial complexes. For instance, one can think of a hexagon as modeling a circle, with the understanding that the modeling is accomplished with a small amount of error:
Of course data sets can have notions of shape, too. For example, here is a data set that we can recognize as having a circular shape, even though it only consists of samples and is not a complete circle.
A circular shape may be an indication of periodic behavior. In a mechanistic interpretability context, Engels et al showed that some LLM SAE features are organized in a circular pattern, and that those features correspond to temporal periodic structures like days of the week or months of the year.
There are numerous other examples where periodic data is shown as a circle when graphed, notably in dynamical systems like predator/prey models.
II. Mapper
Mapper is the name for a family of methods that use topological ideas to build graphs representing data sets. The core concept behind Mapper is the nerve of a covering. A covering of a set X is a family of subsets U0,U1,…,Un⊂X so that X⊆⋃iUi. The nerve graph of the covering is the graph N whose vertices vi correspond to the sets Ui, and where vertices vi and vj form an edge in N if Ui∩Uj≠∅, i.e. if Ui and Uj overlap. As an example, suppose the set X is as shown below, with covering by four sets colored red, yellow, blue, and green, with overlaps as indicated.
The nerve graph has four vertices, one for each of the covering sets. The vertices corresponding to the yellow and red sets are connected by an edge because they overlap. The vertices corresponding to the yellow and blue sets are not connected by an edge because they do not overlap. Building a graph representation of a dataset by constructing a good covering is a powerful technique, motivated by fundamental results like the nerve lemmas, which give guarantees about topological equivalence of a space with the nerve of a sufficiently nice cover of that space. There are numerous strategies for constructing such graph models motivated by this simple construction. Of course, the graphs constructed often have many more vertices than the model above. This kind of graph modeling is a part of an area of data science called Topological Data Analysis.
Graphical modeling can be used to understand the internals of neural networks, as illustrated below, from this paper (a presentation can be found here).
These graphs were obtained from VGG16, a convolutional neural network pre-trained on ImageNet. For each layer, we constructed the data set of weight vectors for each neuron, including only those vectors satisfying a certain local density threshold. One can see that in the first two layers, the graph model is circular, and it shows that the weight vectors are concentrated around those which detect approximations to linear gradients. Later layers always include these but also additional ones. For example, layer four includes weight vectors which detect a horizontal line against a dark background. Layer five includes a white “bulls eye” and a crossing of two lines. Later layers include combinations of these. The coloring of the nodes encodes the number of data points in the set corresponding to the node, so red points would contain more points than green or blue ones. These visualizations demonstrate the presence of geometric structure in VGG16's weight vectors, indicating that specific, interpretable features are learned at each layer.
A second example performs the same kind of analysis for a two hidden layer convolutional neural network, but observing how the structure changes over the course of training. In this case, in the first layer, one can see roughly random behavior after 100 iterations, but after 200 iterations, one sees concentration (as indicated by the redness) around points on the circular boundary, which correspond to the linear gradients as in VGG16. This pattern becomes even more pronounced after 400 iterations, but begins to degrade after 500 iterations. In the second layer, one sees a very weak circular boundary through the first 500 iteration, becoming more pronounced after that. One can hypothesize that the second layer is “compensating” for the degradation occurring in the first layer. The first layer has opted to retain the linear gradients in the vertical and horizontal directions, but has additionally included a black bulls eye on a lighter background. This is unexpected behavior, and probably is due to the small number of layers in this network. What we would have expected is behavior similar to that seen in VGG16 above, in which the earliest layers respond to the simplest local behavior, namely an edge, and later layers to more complex behaviors.
III. Mechanistic Interpretability
Apollo Research recently led an extensive report on open problems in mechanistic interpretability, with a large portion focused on open questions about SAEs. Some of the issues that stood out to us were:
IV. Graph Modeling of SAE features
Question 1 above concerns the need for a geometry on feature spaces. This is a key ingredient in interpretability of features. We think TDA can help understand this feature geometry, and we'll show a few simple examples we've tried on the SAE features constructed by OpenAI for GPT-2-small. The graphs we build are constructed using BluelightAI's Cobalt software, which employs a variant of the Mapper technique outlined in Section II. We did need to implement a few workarounds to make this function, and we plan to share a cleaned-up Colab notebook detailing the process in the near future. The largest component of the graph constructed on these SAE features is displayed below. We used cosine similarity to compare features.
Each node of the graph corresponds to a collection of the SAE features. Below we will show selections A and B from the above diagram, and indicate what words or concepts trigger the features in each node or region. Each SAE feature activates with varying frequency on different sets of words, and collections of features are labeled by the most frequently occurring words in the collection.
Selection A contains a three step progression, which looks like this:
(1) Regard to, attributed to, associated with, likening to
(2) Connect, link, belong, contribute, refer
(3) In conjunction with, listening to, according to, access to
All three have to do with relationships. (1) describes type of relationships, and those relationships are conceptual. (2) describes more explicit types of relationships, and (3) gives even more explicit and detailed forms of relating.
Selection B contains a “Y”-shape enclosed in the upper left, and we can interpret it like this:
V. Next steps for TDA and SAE features
We have a lot more ideas for how to use TDA to help better understand SAEs and neural network activation spaces more generally.
We looked at the geometric structure of SAE features themselves here, but we think these features may also be useful as a way to better understand the topological structure of activation space:
VI. Summary
We have demonstrated the use of topological data analysis in the study of SAEs for large language models, and obtained conceptual understanding of groups of these features. This methodology is quite powerful, and holds the promise for the mechanistic understanding of the internals of large language models.
VII. Acknowledgments
We thank Lee Sharkey for his helpful comments and suggestions.