Geometry of Features in Mechanistic Interpretability

Gunnar Carlsson

This post is motivated by the observation in Open Problems in Mechanistic Interpretability by Sharkey, Chugtai, et al that `` SDL (sparse dictionary learning) leaves feature geometry unexplained", and that it is desirable to utilize geometric structures to gain interpretability for sparse autoencoder features. We strongly agree, and the goal of this post is to describe one method for imposing such structures on data sets in general. Of course, it applies particularly to the case of sparse autoencoder features in LLM's. The need for geometric structures on feature sets applies generally in the data science of wide data sets (those with many columns), such as those occuring as the activation data sets in complex neural networks. We will give some examples in the life sciences, and conclude with one derived from LLM's.

Wide Data Sets

Many of the data sets that are now of great importance are equipped with a very large number of features. For example, in genomics, the feature sets can be parametrized by sets of genes with thousands or tens of thousands of elements. Text data can be encoded by various types of embeddings, most of which have hundreds or thousands of features. When data sets are described via a data matrix, where the rows are the data points or samples, and the columns are the features, data matrices with many features will be shorter than they are wide, and we can call such data sets wide. Wide data sets pose some special challenges for inference. For example, if we are trying to understand how to predict an outcome variable from a very large number of independent variables, we will actually expect a small number of features to appear to correlate with the outcome simply by chance. There are methods such as Bonferroni correction or the Benjamini-Hochberg procedure for mitigating this problem, but we are often left with a lack of conviction about our answers. Also, in many application areas (genomics is certainly one) we find that a particular outcome really corresponds to a large group of features acting in concert rather than to a single feature. It is then important to be able to identify such meaningful groups of features. We can then ask for methods that work easily and intuitively to construct and manipulate meaningful groups of features. Some methods which accomplish this come from topological data analysis, specifically using variants of the Mapper methodology. These methodologies have been used to provide graph models which summarize the behavior of data points in a data set, and provide useful taxonomies for the data points. However, if we think of data sets as represented by data matrices, with the rows corresponding to the data points, and the columns corresponding to the columns, we can view the columns as a data set, and apply the methodology to them. One can think of this process as applying the Mapper methods to the transpose of the data matrix.

Topological Models

A topological model for a data set is a collection of subsets $X_{α} \subseteq X$ so that

$X \subseteq ⋃_{α \in A} X_{α}$

The advantage of such models is that they admit graphical representations. The graph associated to a topological model ${X_{α}}_{α \in A}$ has as its vertex set the set $A$ , i.e. there is one vertex for each of the sets $X_{α}$ . There are various criteria that one can use to decide whether or not a pair of vertices form an edge in the graph. One is the "nerve criterion", where the vertices $α$ and $α^{'}$ span an edge if and only if $U_{α} \cap U_{α^{'}} \neq \emptyset$ .

Figure 1

In Figure 1 above, we see a "lollipop shape" covered by balls, and below it we see the corresponding nerve construction. Variants could require that the cardinality of $U_{α} \cap U_{α^{'}}$ is bounded below by an integer $k > 0$ , or that the average density of points within $U_{α} \cap U_{α^{'}}$ is bounded below by a fixed positive real number. Another one would be that given two vertices $v$ and $w$ , they are connected if the average distance between pairs of points $x$ and $y$ , with $x$ and $y$ belonging to the sets corresponding to $v$ and $w$ , is less than a given threshold. There are many more possible such criteria. Randomly chosen topological models are typically not very meaningful, but in the presence of a dissimilarity measure (likely a metric or distance function), one can construct models which represent the metric or relational properties of $X$ effectively. The main observation is that subsets of the vertex set correspond to subsets of $X$ , since we can assign the subset $⋃_{i = 1}^{n} U_{α_{i}}$ to the collection of vertices ${α_{1}, \dots, α_{n}}$ .

In order to obtain information about the data sets such models, we observe that a data point $x_{i}$ (i.e. a row in the data matrix $D$ ) defines a function on the vertex set of a topological model as follows. The point $x_{i}$ corresponds to a function $f_{x_{i}}$ on the set of columns of $D$ , by setting $f_{x_{i}} (c_{j}) = D_{i j}$ , where $c_{j}$ denotes the $j$ -th column of $D$ . From $x$ we obtain a function (which we also denote by $f_{x}$ ) on the vertices of the topological model of the feature space as follows. For $v$ corresponding to the subset ${α_{1}, \dots, α_{n}}$ of the set of features (columns of $D$ ), we define $f_{x} (v)$ to be the average value of the features $α_{i}$ on the data point $x$ . Explicitly we have

$f_{x} (v) = \frac{α_{1} (x) + \dots + α_{n} (x)}{n}$

By choosing a color scale, the function $f_{x}$ can be represented as a "graph heat map", where we color each node using the color value associated with $f_{x} (v)$ . Given a collection of data points $U$ , we can also construct a function $f_{U} (v)$ by averaging over $U$ . This kind of analysis becomes very useful when we are trying to contrast the behaviors of features on distinct groups of data points. The functions $f_{x}$ and $f_{U}$ often turn out to have reasonable continuity properties, making them interpretable. We will now show this idea in action in a few cases.

Examples

We will show two examples from the general area of life sciences, one in genomics and the other in the study of the gut biome. We will then conclude with an example of the methods being applied to the sparse autoencoder (SAE) features constructed by OpenAI for GPT2. The first two examples are to demonstrate the value of the geometric structure on feature sets in simpler contexts than LLM's, but we note that the ARC Institute has constructed SAE features for their generative model Evo2 for DNA sequencing. This means that topological methods and the SAE methods have utility for generative models beyond LLMs.

Figure 2 below shows a topological model of the patients in a study of breast cancer from the Netherlands Cancer Institute (NKI). The paper describing the work is here. The features are mRNA expression levels of 1500 genes in each patient, so the data matrix has 1500 columns. There were 272 patients in the study.

Figure 2

The model suggests three major groups, labeled Normal/Normal-Like, Basal-Like, and c-MYB+ tumors. The Basal-Like cohort has poor prognosis, while the c-MYB+ group has perfect survival. It is therefore of interest to compare the groups' behavior. This is a wide data set, with many genes acting in concert. It is therefore useful to construct a topological graph model of the set of features, which gives clarity into the groups. In Figure 3 below, there are three copies of a graph model of the set of genes used to construct the model of the set of patients, with a color heat map of the values of each of the genes. The graph breaks up into two strongly connected pieces, with weak connections between the two. We'll refer to them as the left and right subgraphs.

Figure 3

There is a clear differentiation between the three graph heat maps. The distinction between the c-MYB+ and the Basal-like tumors is very pronounced. In the case of the c-MYB+ tumors, the bottom of the right subgraph is colored dark blue (so those genes are underexpressing) and the top is colored bright red (overexpressing). On the other hand, for the Basal-like group, we have the lower part of the right subgraph colored strongly red and the upper part primarily blue, but which a core group that is bright red. So the distinction between the tops and bottoms of the graphs are characteristic of the two groups. except that there appears to be a group of genes within the upper part which in fact express strongly in all three groups. The Basal-like and Normal-like groups produce relatively similar heat maps, but the red colorings on the bottom and blue colorings on the top appears to be smaller and less intense. Also, the left group appears to be slightly stronger red than in the basal like, and significantly stronger than in the c-MYB+ group. The above referenced paper includes some analysis of the genes driving the differences between the c-MYB+ and Basal-Like groups.

We offer an additional analysis in Figure 4 below, this one coming from a study of the gut biome by the Larry Smarr group at University of California, San Diego. The features consist of abundances of bacterial subpopulations. This is also a wide data set, including thousands of distinct subpopulations. There are four copies of a topological model on the subpopulations, colored by healthy patients (lower left), ulcerative colitis (lower right), and two forms of Crohn's disease (ileal and colitic) above. On the lower right we circle a group of subpopulations which is highly represented in ulcerative colitis and which is not in healthy patients. Similarly, on the lower right, we circle a group which is heavily represented in the healthy patients and not well represented in either of the forms of Crohn's disease. In both cases, we have identified the only regions (collections of subpopulations) in which there appear to be a significant distinction with the healthy patients.

Figure 4

The above two examples were constructed using the Mapper methodology The examples below use a different methodology (instantiated in software such as Cobalt from BluelightAI).

Our final examples come from the analysis of sparse autoencoder features from OpenAI's GPT2-small large language model. For generalities on sparse autoencoders, please look at this. These are features that are constructed from the internal states of GPT2-small, and are designed to be more interpretable. It is also possible to find tokens with highest value for a particular feature, and that list of tokens is used to produce a kind of explanation for the feature, which can be retrieved. We have built topological models on this set of features. This allows us to produce groups of features which act similarly, and can produce explanations that cover all the features in the groups. In Figure 5 below, we display one such model for a subset of the features. We have also indicated some coherent groups that appear in the model, providing a higher level explanation for the features.

Figure 5

In this case, the explanations are syntactic, and not particularly semantic. In Figure 6 below, we have constructed a different topological model, in which one can see a long ‘“flare”. Each of the nodes along the flare contains several features, and we have identified three nodes along the flare, nodes, A,B, and C.

Figure 6

Rough descriptions of nodes A,B, and C are as follows. Here are some features contained in each of them.

Node A:

Critique or commentary on social issues
Concepts related to dissent and leadership on social issues
Terms associated with urgency and social dynamics
Terms related with politics and governance
Groups of people or population

Node B:

Terms indicating actions or intentions related to personal agency
Concern on commentary on societal conditions and issues
Phrases related to decision making
Verbs associated with actions or processes related to manipulation and change
Assertions and comparisons that imply truth

Node C:

Specific names or references to individuals involved in events or actions
Statements of opinion or judgment on social issues
References to specify the names and affiliations of people or organizations
Relevant statements about legality and social norms
Mentions of specific companies, organizations, or brands

We can make high level observations. Node A contains high level ideas, without anchoring to any particular action or situation. Node B contains terms with more action and definiteness. Node C contains mostly specific mentions of individuals or organizations. As we move inward along the flare, we become less conceptual and more action-oriented and specific.

We point out that in each node, we have selected subsets which have some conceptual coherence. In the case of each node, there are other characteristic groups. For instance, Group C has a significant number of features attached to mathematical, numerical, and programming actions and ideas. Group B has features related to sports and video games. Nevertheless, there appears to be a definite trend toward greater specificity and more action orientation as one moves inward along the flare. Here you will find a colab notebook where you can look at models of the GPT2-small features.

Summary

We have supported the belief that the use of geometric structure on sets of features is useful for data sets with large numbers of features with the construction of topological or graph theoretic models. The graph models increase interpretability of results, in that they help identify meaningful groups of features which act in concert. They also allow for exploration of data sets in that one can easily represent the behavior of data points (sample) or groups of such via graph heat maps.

LESSWRONG
LW

16

Geometry of Features in Mechanistic Interpretability

16

16