Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom
This work was produced as part of the PIBBSS program summer 2024 cohort.
tl;dr
Sparse AutoEncoders (SAEs) are a promising method to extract monosemantic, interpretable features from large language models (LM)
SAE latents have recently been shown to be non-linear in some cases, here we show that they can also be non-independent, instead forming clusters of co-occurrence
We ask:
How independent are SAE latents?
How does this depend on SAE width, L0 and architecture?
What does this mean for latent interpretability?
We find that:
Most latents are independent, but a small fraction form clusters where they co-occur more than expected by chance
The rate of co-occurrence and size of these clusters decreases as SAE width increases and that these clusters form in both GPT2-Small ReLU SAE and Gemma-2-2b JumpReLU SAE (Gemma-Scope)
Clusters map interpretable subspaces, but remain largely interpretable independently within them. However, we find cases of composition, where latents are best interpreted in the context of the cluster, as well as cases of co-occurrence driven by ambiguity, where this may be useful as a measure of uncertainty
Key examples:
Composition: we observe a cluster that maps out discrete quantifiers ('one of', 'some of', 'all of') where mixtures of latents have predictable interpretations, and that these mixtures depend on relative activation strength. E.g. a mixture of latents for 'some of' and 'all of' will correlate with text ranging from 'many of' to 'almost all of' as the strength of the former decreases and the latter increases.
Ambiguity: we observe many clusters where SAE latents correspond to meanings of a word, and multiple latents are active when the meaning is ambiguous e.g. mapping the space between 'how' as in 'how are you?' vs 'you don't realise how tough this is'
We believe our results show that SAE latents cannot be relied upon to be independent in all cases, and that their co-occurrence should be considered when interpreting them, especially in small SAEs
We made a website that lets you explore SAE latent clusters.
Summary
Sparse AutoEncoder (SAE) latents show promise as a method for extracting interpretable features from large language models (LM), but their overall utility for mechanistic understanding of LM remains unclear. Ideal features would be linear and independent, but we show that there exist SAE latents in GPT2-Small and Gemma-2-2b that display non-independent behaviour, especially in small SAEs. Rather, latents co-occur in clusters that map out interpretable subspaces, leading us to question how independent these latents are, and how does this depend on the SAE? We find that these subspaces show latents acting compositionally, as well as being used to resolve ambiguity in language, though SAE latents remain largely independently interpretable within these contexts despite this behaviour. Latent clusters decrease in both size and prevalence as SAE width increases, suggesting this is a phenomenon of small SAEs with coarse-grained latents. Our findings suggest that better understanding of how LM process information can be achieved in some cases by understanding SAE latents as being able to form functional units that are interpreted as a whole.
Introduction
In order to understand language model (LM) behaviour, we require a way to decompose their internals into interpretable, functional subunits, i.e. 'features'. These subunits must be mathematically defined in relation to the underlying LM, able to be given a clear semantic description, and of use to make testable predictions (Sharkey et al., 2024). There has been success in semantic labelling of neurons (Bills et al., 2023), as well as using them to extract interpretable circuits (Wang et al., 2022, Nanda et al., 2023). However, we expect that neurons are in general hard to interpret, as an LM must be able to respond to, and respond with, more concepts than there are neurons. Thus neurons are often polysemantic, e.g. due to superposition (Elhage et al., 2022). Therefore, we require a better basic unit of model internals to work with, ideally ones that are monosemantic, linear and independent.
Recently, Sparse AutoEncoders (SAEs) latents have shown promise as meeting all the criteria proposed by Sharkey et al., 2024 for features for mechanistic interpretability, in that: they are conceptually simple in relation to the underlying model; their latents are often readily interpretable (Bricken et al., 2023, Huben et al 2024); and have may be causally relevant such that we can steer with them (Templeton et al., 2024). Therefore, there is hope that these SAE latents can be used as a good proxy for underlying LM features needed for scalable and effective mechanistic interpretability.
Ideally, it would be possible to extract LM features that are linear (Park et al., 2023) and independent. Recent work by Engels et al., (2024) shows that SAE latents are not all linear, but rather that some latents, such as those that fire on days of the week, map out irreducible, multi-dimensional subspaces.
In this work, we show:
that some SAE latents do not have independent activation distributions, and in some cases they show strong co-occurrence instead. This suggests that decomposing model internals may involve combinations of latents, which therefore may need to be interpreted collectively; i.e. an irreducible subspace. By clustering SAE latents via co-occurrence, we find many instances of interpretable subspaces. These remain largely interpretable by single SAE latents, but we demonstrate cases where features appear to be encoded as the composition of multiple latents.
Furthermore, co-occurrence occurs more strongly in small SAEs, and both the number and size of the clusters decreases as SAE size increases. This is reassuring evidence that SAE latents are independently interpretable in many cases, and that this number can be increased by increasing SAE size.
Figure 1: Local vs compositional featuresEngel's et al., (2024) show that the features corresponding to the days of the week form a circular subspace. However, it is unclear whether the SAE latents (left) are a good representation of the underlying LM features (right). SAEs are biassed towards a local code, but a compositional encoding might be more flexible and so rewarded in the training of the underlying LM (Olah, 2023).
Cluster formation in the case studies we observe often fall into two categories:
Composition: We observe cases that suggest compositional interaction between latents. The same information can be represented as a local or a compositional encoding (Olah, 2023). SAEs, due to their L1 sparsity constraint, are biassed toward a local encoding, even when this is not a reflection of the underlying LM representation. Indeed, experiments in toy models that are encoding only compositional features show that SAEs nevertheless are unable to decompose these features, instead extracting composed latents (Anders et al., 2024) (Figure 1). We observe cases of potential compositional encoding such as: one latent detecting which day of the week a token is, but a second latent acting as a modifier, detecting whether the token contains a space (GPT2-Small, Layer 0, Cluster 3240). Furthermore, we observe that the tokens that activate combinations of latents are predictable from the combinations of the interpretations of the latents acting alone e.g. a latent that fires on the words 'some of' and a latent that fires on the word 'all of' activate together on the case of 'many of' (Gemma-2-2b, Layer 12, Cluster 4740), or the position of a token in a string apparently being measured by combinations of latents (GPT2-Small, Layer 8, Cluster 125).
Ambiguity: Natural language is inherently ambiguous and contextual, and this nature is essential to many forms of communication such as poetry and humour (Nerlich et al., 2001, Piantadosi et al. 2012). Further, much problematic behaviour stems from an LM being 'overconfident' in one of many possible understandings of an instruction or its own knowledge (Huang et al., 2023, Shah et al., 2022, Ouyang et al., 2022). To ensure safe and aligned behaviour in LM, it is important to be able to verify whether the model is able to understand the multiple meanings of a query or a safety instruction, but this has proven challenging (Liu et al 2024, Kim et al 2024). Therefore, a method to detect: what parts of a prompt are ambiguous, what potential meanings an LM is 'considering' and the degree to which it is weighing the options, is highly desirable. We observe that many clusters appear to be using the relative activation strengths of latents encoding different meanings to disambiguate e.g. uses of the word 'how' (GPT2-Small, layer 8, cluster 787) suggesting that co-occurrence can be driven by ambiguity may be useful for quantifying uncertainty about possible meanings of a word.
We investigate these behaviours in case studies using both GPT2-Small and the more recent and larger model Gemma-2-2b. We also present a Streamlit app allowing for exploration of a subset of the clusters we find at https://feature-cooccurrence.streamlit.app/, which we link to for all examples in this post, see also Appendix: Case Studies. Code to generate clusters and plot the figures from this paper is available at https://github.com/MClarke1991/sae_cooccurrence.
Background
Superposition: LM must be able to represent many features, often more than they have neurons. This leads neurons to be polysemantic. It is further hypothesised that this compression can extend to representing more features than there are dimensions by using sparse, but not orthogonal embeddings of features (Elhage et al., 2022), i.e. superposition.
Sparse AutoEncoders: An autoencoder learns efficient encodings of input data x in an unsupervised manner by learning two functions: an encoding function WE + bEand a decoding function WD + bE.Between application of these matrices a nonlinearity such as a ReLU is typically applied, although at time of writing these differ between many competing architectures (Bussman et al., 2024, Rajamanoharan et al., 2024a, Rajamanoharan et al., 2024b, Braun et al., 2024). In a sparse autoencoder, the encoding is trained such that the output of WEhas as few active states as possible. This is typically achieved by a loss function that, in addition to penalising for errors in the reconstruction of the input data, penalises the L1-coefficient of the SAE encoding. Note that this means that the sum of the activations of the hidden layer are optimised to be less than or equal to 1, but not that the number of activations is 1. When the output of WE, i.e. the width of the SAE, is large, then it is observed that the SAE latents behave like interpretable features of the underlying model (Bricken et al., 2023, Huben et al 2024, Templeton et al., 2024).
Sparse AutoEncoders for Feature Extraction: While there has been success in finding interpretable neurons in LM (Bills et al., 2023), many are polysemantic and hard to interpret, and so make poor features (Sharkey, 2024). SAEs have been shown to be able to extract interpretable latents (Huben et al., 2024) in both small and large LM (Bricken et al., 2023, Templeton et al., 2024, Lieberum et al., 2024), in part through taking features out of superposition (Sharkey et al., 2022). Steering on SAE latents (Templeton et al., 2024) had early success but has recently been shown to be prone to off-target effects, casting doubt on the reliability of SAE latents for interpretable and predictable steering (Anthropic, 2024). Similarly, it has been shown that SAE latents can 'absorb' some of the function of otherwise monosemantic latents (Chanin et al., 2024). Optimal SAE architecture is not yet clear, with recent work collating benchmarks of various such failure modes (Karnoven et al., 2024).
Linear Representations: It has been proposed that concepts might be represented linearly in large language models (Park et al., 2023) despite superposition (Elhage et al., 2022). The weak form of this, that many features are linear, is supported by work showing that categorical concepts form linear hierarchies in LM such as Gemma (Park et al., 2024).
Non-linear Representations: SAE latents do not appear to be linear in all cases, with some forming multi-dimensional, irreducible subspaces (Engels et al., 2024). This suggests the driver behind neuron poly-semanticity may be more complex than superposition (Mendel, 2024), and in turn casts doubt on the idea that an LM can be decomposed into purely monosemantic features in principle, (Smith, 2024). Recent work suggests that SAE latents can themselves be further decomposed into meta-latents (Bussmann et al., 2024) highlighting their own polysemantic nature.
Compositional Features: Concepts in input and output data of LM can be encoded as local, compositional or similar codes (Olah, 2023). SAEs are not designed to decompose features (Till et al., 2024), and are poor at recovering such (Anders et al., 2024), but there has been recent work on interpreting SAE latents in light of this (Ayonrinde et al., 2024, Anthropic, 2024).
Ambiguity: Ambiguity is a key part of natural language, but poses a difficulty for placing robust constraints on LM behaviour through approaches such as system prompts, and may contribute to problems such as inferring user goals (Christiano (2018)), mis-generalisation from training (Shah et al., 2022) and the inability of models to consistently understand when they should give fictional or factual answers (Huang et al., 2023). Efforts to address this include the construction of datasets of ambiguous and clarified sentence pairs (Liu et al 2024), measurement of LM uncertainty through exploitation of semantic ambiguity (Kuhn el al., 2023) and training of LM to ask for clarification when prompted with an ambiguous sentence (Kim et al 2024). Studies have also been done on recovering the belief states of LM from the residual stream (Shai et al., 2024).
SAE Latent Co-occurrence: Parsons et al., 2024 explored whether co-occurrence defined by latent attribution score correlation revealed latents that acted differently when steered on together, compared to random pairs of latents.
Methods
Figure 2: Overview of extraction of co-occurrence clusters
Our method consists of 6 stages (see Figure 2).
Measuring SAE latent co-occurrence
We use the pre-trained GPT2-Small residual stream SAEs, including SAEs of different sizes trained on layer 8 of GPT2-Small (sizes 768, 1536, 3072, 6144, 12288, 24576, 49152), as well as the Gemma-scope SAES for Gemma-2-2b, using SAE Lens. Using the record of SAE activations from training (the ActivationsStore), we extract batches of activations, each of 4096 tokens, and record which SAE latents were active in these. We further consider a latent to be active only if it is active above a threshold of 1.5 for GPT2-Small, unless otherwise stated, to reduce noise.
Figure 3: SAE latent co-occurrence per token in GPT2-Small: (Left) Co-occurrence of latents on the same token across a sample of 2.048 x 106 tokens using the Neuronpedia 768 width SAE for GPT2-Small after Jaccard Normalisation. (Right) Mean fraction (blue) and number (red) of SAE latents firing per token in the same sample size for all SAE widths in the Neuronpedia GPT2-Small feature splitting dataset.
We count the number of co-occurrences per token for different layers (0 and 8 in GPT2-Small, 0, 12, 18 and 21 in Gemma-2-2b). We observe that the mean number of latents active per token rises with SAE size but seems to stabilise around 30 latents per token in GPT2-Small (Figure 3, Appendix Figure 1). This means that the fraction of the latents of an SAE occurring per token reduces with size as the SAE width increases, in GPT2-Small and Gemma-2-2b (Figure 3, Appendix Figure 1).
Generating graphs of strongly co-occurring communities of SAE latents
Figure 4: Normalisation by Jaccard Similarity: (Left) Histogram of frequency of latent occurrence in the dataset. (Right) Schematic of Jaccard Similarity: the size of the intersection divided by the size of the union of the sample sets.
We further observe that a small subset of latents are active on very many tokens (Figure 4). To allow better comparison of the likelihood of co-occurrence between latents, and to weight for co-occurrences that are not due to the activations of one latents being a superset over all tokens, we chose to normalise by Jaccard Similarity (Jaccard, 1912) (Figure 4), i.e. we consider the strength of co-occurrence between two latents to be scored as: J(A,B)=|A∩B||A∪B|=|A∩B||A|+|B|−|A∩B|
Figure 5: Process of edge removal to form unconnected subgraphs: (Top) Comparison of the number degree (number of edges) of nodes after Jaccard similarity normalisation (blue) and after further removal of low weight edges. (Bottom) Graph of co-occurrences before (left) and after (right) removal of low weight edges (right side shows only graphs of size greater than 1 node).
We represent the co-occurrences as a graph (Figure 5). After normalisation, we find a very large number of low-weight edges, associated with a high-degree for nodes in the graph. In the expectation that communities within this graph will represent the most useful clusters of SAE latents, we remove low weight edges through a binary search for a threshold that leads to the largest subgraph size being ≤ 200. This decomposes the graph into many subgraphs of size 1, which we refer to as isolated latents, as well as communities of > 1 latents which we investigate further.
Note, in the case of Gemma-2-2b, we found strong co-occurrence driven by activation on the BOS token, which had been removed prior to SAE training in the case of GPT2-Small, and this led to our binary search finding a very high threshold for edge weights to generate clusters of the required size. We therefore ignore cases where SAE latents are active on special tokens (PAD, BOS, EOS) when calculating the rate of feature occurrence and co-occurrence in the case of Gemma-2-2b.
Mapping subspaces using SAE latent clusters
Figure 6: Principal Component Analysis of a resulting subspace: We find prompts that contain activations of the SAE latents in a cluster, and then use a PCA to represent the vectors made up of these activations from the SAE latents within the cluster. This example is explored further in Figure 16.
We search through the training data for examples of prompts that activate latents within a cluster. We then use principal component analysis (PCA) to represent the vectors made up of the same latent activations from only the SAE latents in that cluster and so explore if these latents are more explicable as a group (Figure 6).
SAE latents co-occur both more and less than chance
If we want to reason about latents as monosemantic units, then we want to know that our interpretation of a latent can be reasoned about independently of other latents. To assess this, we measure the rates of co-occurrence of latents with one another on the same token, and compare this to the expected rate of co-occurrence if the latents were independent (E[co-occurrences(i,j)] = p(i) * p(j) for latents i, j).
Figure 7: Co-occurrence as SAE size increases in GPT2-Small. (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths (top, with outliers, bottom, without outliers). Blue: observed co-occurrence, red: expected co-occurrence.
We measure the rate of co-occurrence for a range of SAE widths probing the residual stream of layer 8 of GPT2-Small and layer 12 for Gemma-2-2b. We find that the rate of co-occurrence per token decreases monotonically as SAE size increases (Figure 7, Appendix Figure 2, 3, 4). We further observe that the distribution of latent co-occurrence is broader than the expectation, with many latents co-occurring much more rarely, but some occurring more frequently than expected. These latter decrease in prevalence as SAE size increases (Figure 8, Appendix Figure 2, 3, 4), suggesting that the increased granularity of latents allows them to fire more independently, and that this phenomenon will reduce in importance.
Figure 8: Density of co-occurrence in GPT2-Small Density of rates of co-occurrence (blue observed, red expected) for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Observed co-occurrence is always an integer, and so cannot generate a result lower than zero on the x-axis, hence we plot from Log10 Co-occurrence of zero.
SAE co-occurrence clusters are smaller and fewer in large SAEs
We measured the size of the clusters as the SAE size increased, and observed that subgraph size decreases as SAE size increases, suggesting fewer significant co-occurrences as the number of latents grows (Figure 9, Appendix Figure 5). This is clearest when considering subgraphs that are of size > 1 i.e. where there is at least one case of co-occurrence. We similarly see that the mean subgraph size decreases as L0 decreases in Gemma-2-2b (Appendix Figure 6). However, we find that the L0 sparsity plateaus as SAE size increases in the case of GPT2-Small, both for SAE latents considered individually and when considering clusters as a unit (GPT2-Small Figure 9, Gemma-2-2b Appendix Figure 7). We find that the sparsity of subgraphs matches that of features closely in Gemma-2-2b when directly comparing the same layer with different L0 (Appendix Figure 8). We also see a decrease in fraction of features in a cluster as width increases, though this plateaus (Appendix Figure 9, 10).
Figure 9: Change in cluster (subgraph) size with SAE width in GPT2-Small. (Left) mean size of clusters as SAE width increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed). (Right) L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) (dashed).
Co-occurrence relations detect groups of latents that map interpretable subspaces and can be interpreted as functional units
Through qualitative analysis of clusters, we find that co-occurrence clusters form groups that seem a priori coherent when looking at, for example, the token promoted, such as a cluster of months of the year (Figure 10). In order to explore this further, we searched for examples of text that activated latents within a cluster using the activations store, and plotted a PCA of latent activations for these examples.
Figure 10: Example clusters (GPT2-Small): Cluster of latents shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Left) Example from case study of days-of-the-week (see Appendix Figure 11) and url-subdirectory position (see Figure 15).
Co-occurrence can be used to map cases of compositionality between latents
SAEs trained on Gemma-2-2b (Gemma Scope) encode qualitative statements about the number of items compositionally
Figure 11: Composition of latents map to interpretable regions of subspace mapped by cluster: (Left) PCA of subspace mapped by SAE latents, colour is most active latent, annotations show clusters defined by different compositions of latents (hub (red, latent 12257) is active in all cases, excluded from annotation for clarity) (Gemma-2-2b, layer 12, cluster 4740). (Right) Graph of co-occurrence between latents in the cluster, coloured to match left-hand plot. Note the presence of a single 'hub' latent and several 'spoke' latents.
We generate co-occurrence clusters for all SAE latents for the set of 16K width SAEs trained on Gemma-2-2b, for layers 0, 12, 18 and 21. Of these, we searched through PCA for clusters of size 5-7 for interpretable cases.
We find that cluster 4740 of layer 12 fires on the token ' of' and appears to separate cases of different qualitative descriptions of amounts, e.g. 'one| of|' vs 'some| of|' vs 'all| of|' (e.g. 'also known as Saint Maximus of Aquila) is one of| the patron saints of L'Aquila, Italy.') (Figure 11).
The different latents attend to the following cases (bold added for emphasis):
latent 12257: strictly 'one of' e.g. 'This is, by far, one| of| our most popular packages'
latent 5004: 'one of n' e.g. 'the County Executive. The County Executive is just one| of| five members of the Board of School Estimate, and'
latent 15441: 'some of' e.g. 'But the Catalan coach hinted that some| of| the youngsters who have impressed in the United States could'
latent 12649: 'all of' e.g. 'All| of| them signed their extensions at different times, '
Examining the PCA, we observe that the 'hub' of the graph of co-occurrence relations (latent 12257) fires in all cases (Appendix Figure 12), but the 'spokes' act as modifiers that, in combination, activate in different cases. Further, we find that combinations of spoke latents, with this hub still active, correspond to distinct groups, which are predictable from a mixing of these basic categories (bold added for emphasis):
'one of n' (latent 5004) + 'all of' (latent 12649) = implied all of (such as 'both of' or 'each of'), e.g. 'constant during the course of the experiment. For each| of| the four datasets, the following experimental data are measured'
'some of' (latent 15441) + 'all of' (latent 12649) = 'most of' e.g. 'than Monday, he said. The ISO oversees most| of| the state's \npower grid.'
Figure 12: Latent compositional groups correlate with semantic groupings of contexts: (Left) PCA coloured by what combination of SAE latents are active for each case(note that feature is considered active only if above 1.5 activation strength). (Right) PCA coloured by context group (See Appendix Code 1 for grouping of semantic categories. The distinction between 'one of' and 'one of n' is the least clear, but this may in part be due to manual labelling of these examples missing potential implications of the total number far from the token that fires in the context.)
These categories form a continuum in the PCA, with the mixture of latents correlating strongly with the semantic mixtures (Figure 12). Thus qualitative descriptions of the number of items in a group appear to be encoded compositionally by SAE latents (Figure 13).
Figure 13: Relative Latent activation defines semantic groups: Mean activation for latents in each context group (See Appendix Code 1 for grouping of semantic categories).
There appears to be a relatively continuous shift in the activation of one spoke feature to another as we shift between semantic meanings (Figure 13). Motivated by this, we investigated how much the different semantic categories form a smooth continuum between 'some of' and 'all of'. We found that rather than compositional encoding of these categories by latents being merely Boolean, the relative strengths of the different latents encoded different qualitative quantities, again, as one would naively predict from the base categories. That is, if the activation of 'some of' is higher than 'all of' then one has a smaller quantity (e.g. 'many of') but if the activation of 'some of' is lower than 'all of' then one has a larger quantity (e.g. 'almost all of') (Figure 14).
Figure 14: Relative Latent activation defines semantic groups continuously and predictably: Mean activation for latents in each context group (See Appendix Code 1 for grouping of semantic categories), focussing only on those groups that are formed by composition of 'some of' (latent 15441) and 'all of' (latent 12649) 'spoke' latents along with 'hub' latent (latent 12257) that is present in all cases.
We see a similar cluster in layer 18 (cluster 59) with similar categories, suggesting that this persists between layers. We also see a similar cluster for layer 12 at a lower L0. The canonical 16K width SAE for Gemma-2-2b layer 12 (as accessed through SAElens) has a mean L0 of 80.47. For layer 12, there are also SAEs of this width with mean L0 of 22-445). For L0 of 22 we find a similar cluster with 5 nodes, and that it appears to form much more distinct categories (cluster 111), suggesting less compositional encoding.
Compositional encoding of distance by SAE features in short URLs
Figure 15: Cluster of SAE latents measuring position in url subdirectories in GPT2-Small. (Left) Cluster of extracted latents corresponding to tokens within url subdirectories e.g.'.twitter.com/|e|2zNEIdX' shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting the number of characters within the token that the latents are firing on.
In layer 8 of GPT2-small, using the 24546 width SAE from the latent splitting dataset on Neuronpedia, we find a co-occurrence cluster of 5 SAE latents that fires predominantly in the subdirectory of urls for social media e.g. '.twitter.com/|e|2zNEIdX' (cluster 125). The latents in this cluster fire predominantly on tokens of single or double character length.
Figure 16: Measurement of position with SAE latents (Top) PCA analysis of only those cases in which the token that activates the cluster is a single character, highlighting how far into the subdirectory the token fires (e.g. where the token that causes the SAE latent to activate is surrounded in '|', then in the string .twitter.com/|e|2zNEIdX is the 0th character, .twitter.com/e2|z|NEIdX is the third etc). (Centre) Mean activation of each SAE latent for each position in the url subdirectory. (Bottom) Activation of SAE latents (green) on different tokens within an example prompt (see Neuronpedia). Here we show lengths between 0 and 10 as there are very few examples longer than this, for plots without this filter see Appendix Figure 13
Focussing on the cases where the latent fires on a single character, we see that PCA separates the cases by the distance between the token and the beginning of the url subdirectory (e.g. in the string .twitter.com/|e|2zNEIdX is the 0th character of the subdirectory, .twitter.com/e2|z|NEIdX is the 2nd etc, where the token that causes the SAE latent to activate is surrounded in '|'). Measuring the activity of the cluster latents, we see that they activate for different sections of the url subdirectory (see Figure 16 and Neuronpedia list). This suggests that the composition of these latents may be used to recover the position of tokens in the url-subdirectory, but further work is needed to confirm the accuracy with which position is measured.
The exception to this is latent 19054, which activates predominantly on the lower-case characters of the string (Appendix Figure 14).
Co-occurrence relations detect sharing of 'day-of-the-week' features between SAE latents
Figure 17: Day-of-the-week cluster (GPT2-Small). (Left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'Monday') or with a space (e.g. ' Monday'). Note that in these and the following examples, while often the 'spokes' of the graph are associated with clusters in the PCA in a similar pattern, the order of features in the co-occurrence graph is not related to the order observed in the PCA.
In layer 0 we found a 8 latents cluster with a hub and spoke graph structure, where each latent promoted and activated on a day of the week (e.g. Monday) (see also these latents on Neuronpedia) (cluster 3240). Mapping out this subspace, using PCA on latent activations, we see a similar pattern to that observed by Engels' et al. (2024), where the latents form a ring of the days of the week in order in the PC2 vs PC3 direction. This suggests that the strength of the latent encodes the certainty that a token is e.g. a Monday, but that the direction of the latent encodes the relation of Monday to these other latents, namely that they have a correct ordering and are equally spaced.
In contrast to Engels' et al. (2024), we further observe that the activations form concentric rings, with the outer ring and inner rings corresponding to tokens that contain a day of the week with a space (e.g. ' Monday'), while the middle ring corresponds to a day of the week token without a space (e.g 'Monday') (Figure 17). Additionally, the very inner ring may also contain tokens that are a shortened version of the day e.g. ' Mon' still with a space (compare Figure 17 and Appendix Figure 15). Note that in these and the following examples, while often the 'spokes' of the graph are associated with clusters in the PCA in a similar pattern, the order of features in the co-occurrence graph is not related to the order observed in the PCA.
Figure 18: SAE latent activation strength in 'day-of-the-week' cluster. Each subplot shows the activity of one of the SAE latents for each point in the PCA (blue is low, yellow is high).
However, whether or not a token contains a space (e.g. 'Monday' vs ' Monday') is not encoded by the strength of the e.g. Monday latent activation (latent ID 3266), but rather the second ring is defined by the activation of the hub latent of the cluster (latent ID 8838) (Figure 18).
Figure 19: Latent activation for 'spoke' of the day-of-the-week cluster for the context '30 p.m.| Friday| and 6 a.m'. (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Activation of the latent corresponding to 'Friday' for all prompts (latent ID 3266). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Position of the example prompt in the PCA (star).
Each 'spoke' latent defines one of the seven days of the week, and only fires on these tokens (Figures 18 , 19), with strength corresponding to radius in the PCA, and potentially certainty, given that these latents fire more strongly on e.g. ' Friday' than ' Fri' (Figure 17, Appendix Figure 15). These latents therefore are independently interpretable, as is desirable for useful latent extraction.
Figure 20: Latent activation for 'hub' of the day-of-the-week cluster for the context 'pped Crusaders |Friday|, Nov. 16 at'. (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Activation of the latent corresponding to the hub for all prompts (latent ID 8838). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Position of the example prompt in the PCA (star).
However, the tokens that do not contain a space are detected by the activation of a composition of both the 'spoke' latent for the day and the 'hub' latent (latent ID 8838), with both latents being roughly equally active (Figure 20).
Figure 21: Comparison of latents for 'hub' and 'spoke'. (Left) Neuronpedia profile for spoke latent ID 14244 that appears to denote 'Tuesday' tokens. (Right) Neuronpedia profile for 'hub' latent latent ID 8838 which also activates primarily on 'Tuesday' tokens but we observe as denoting lack of a space in the token when seen in composition with the 'spoke' latents.
Analysed individually, the hub latent seems very similar to the 'spoke' Tuesday feature, and this role in a potentially compositional encoding would not be readily apparent without analysing this latent as part of the functional unit detected by this cluster, suggesting that in some cases these latents can be best understood as part of a larger functional unit defined at least in part by their co-occurrence relations (Figure 21). We observe very similar behaviour with months of the year in the same model and layer (see Appendix Figure 11). We also observe similar hub and spoke clusters where the hub denotes a different modification to the tokens that the spokes fire on, in layer 0 of GPT2-Small, such as: American vs British Spelling: hub activity appears to correspond to 'ise' vs 'ize', Words ending in 'ist' vs 'ism': hub activity appears to correspond to 'ist' vs 'ism' and Singular vs plural words for citizens of a country: hub activity appears to correspond to e.g. 'German' vs 'Germans'.
Other examples of apparent compositionality in GPT2-Small and Gemma-2-2b (found through a non-exhaustive qualitative search) are listed in Appendix: Case Studies and the SAE Latent Co-occurrence Explorer.
Encoding of continuous properties in feature strength without compositionality
In Gemma-2-2b, we also see cases where the latent strength correlates with a quantity (Figure 22, cluster 1370), or where a local code switches to one where there is encoding by latent strength (Appendix Figures 15, 16, Gemma-2-2b, layer 21, cluster 511), despite a cluster forming on something that it would be intuitive for a compositional code to form.
Figure 22: Cluster separating number words: (Left) PC1 vs PC3 of PCA of subspace mapped by SAE latents (Gemma-2-2b, layer 0, cluster 1370), colour is the number word in the activating token. (Right) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate.
For example, in cluster 1370 on layer 0 of Gemma-2-2b (16K width SAE), we see separation in PCA directions 1 and 3 of activations by number. Oddly, this cluster activates for numbers from 'two' to 'ten', but not 'one', and only for words ('two') not for digits ('2') (Figure 22).
Figure 23: Separation by number driven by two latents: Activation strength of latents 8129 (left) and 6449 (right) for all points in the PCA.
In this cluster, activation strength of latents 8129 and 6449 corresponds to the size of the number, from two upwards (see Figure 23 and Appendix Figure 18). We find no other cluster for these words in this layer that separates them more cleanly, leaving the question of why distance in a url substring is handled by a potentially compositional encoding (see above), while this is apparently based on latent activation strength alone. This is not an isolated phenomenon, as we see a similar case for ordinal words (e.g. 'first', 'second', 'third'), but in that case we observe a switch from a local encoding of a latent per word to encoding apparently based on relative latent strength of latents that primarily fire on 'second' and 'third' (see Appendix Figure 16).
To investigate further, we train a linear probe to detect number words between one and ten, and not digits. We find that there is a direction in the activation space of the neurons that corresponds to number words from one to ten (see Appendix Figure 19). Searching for SAE latents that have high cosine similarity with this direction, we find that those latents that are within the cluster are represented in the top 10 latents most similar to the direction of the probe. However, there are also latents that are not present in the cluster, e.g. latent 9869, which is maximally activated by the word 'one' (Figure 24), which has higher cosine similarity.
This does not appear to be because we have incorrectly excluded these from the cluster, as these latents do not necessarily have high co-occurrence with other latents with similar direction to our probe. If we compare the rate of co-occurrence for pairs of latents in this group, to the mean cosine similarity of these pairs, we see correlation for those latents in the cluster, but many latents with high similarity for the probe that do not co-occur strongly with any of other latents (Figure 24, Appendix Figure 19). That is, these latents are missing from our cluster because they do not exhibit co-occurrence with other latents representing number words, rather than because our clustering method failed to group them correctly (Figure 24, Appendix Figure 19). This suggests that this example of less interpretable clustering may be related to feature absorption (Chanin et al., 2024), where the other latents that are maximally active for number words (i.e. 'two', 'three') etc are somehow 'sharing' the properties of these numbers while the latent that is maximally activated by 'one' is not. This has been shown to be more common in JumpReLU SAEs (Karnoven et al., 2024) such as those used here for Gemma-2-2b from Gemma-Scope (Lieberum et al., 2024, Rajamanoharan et al., 2024a).
Figure 24: Latents that are related to number words cluster only in a subset of cases: (Left) Top 10 SAE latents in layer 0 by cosine similarity with the direction of a linear probe trained to detect neuron activations in this layer that are associated with number words from 'one' to 'ten' and exclude digits, red bars represent those latents in the co-occurrence cluster in Figure 22. (Right) Correlation between the co-occurrence (y-axis) and the mean cosine similarity with the linear probe direction for pairs of these features. Red points represent pairs that have edges (i.e. strong co-occurrence) in the cluster. We label pairs between the top 5 latents from the left hand plot. Note that the most similar latent to the linear probe (latent 9869) has a very low rate of co-occurrence with other latents associated with number words, despite high cosine similarity to the probe (bottom right).
Ambiguity of beliefs may cause some cases of co-occurrence
Distinguishing between uses of the word 'how' in GPT2-Small
Figure 25: Cluster of SAE latents disambiguating uses of 'how' in GPT2-Small. (Left) Cluster of extracted latents corresponding to uses of the word 'how' shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting those whose context contains the word 'how' or 'how' and a question mark ('?').
I think it shows just| how| difficult this issue is.
ÔøΩÔøΩt realized just| how| much of a pain magic
kered to not realise just| how| dangerous this kind of rhetoric
but you don't realize| how| tough it is," Goodman
, the more I realize| how| important a sense of place
9 through 12 show just| how| powerful the social element of
my face to see just| how| properly they treated this transfer
was surprised to learn just| how| little research had been published
, you have no idea| how| happy you made me.
even he couldn't believe| how| hard he could throw it
Figure 26: Use of 'how' to mean degree for example 'I think it shows just | how| difficult this issue is': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 21547.
This set of latents maps a space that covers different grammatical uses of the word 'how' as an interrogative adverb either to:
describe the degree of something ('how difficult this issue is', 'how much of a pain') (Figure 26)
describe what the manner of something ('how doctors can ethically', 'how the EU should change') (Appendix Figure 21)
query the state or condition of something ('how did you guys meet?, '| How| are ya?') (Appendix Figure 22).
We do not see the fourth interrogative adverb use of how, the exclamative ( e.g. 'How very interesting!').
We note that unlike the case of the days of the week, not all extrema of the PCA are defined by a single feature, with the 'manner' case leading to activation of latents 11726 and 23664. Nevertheless, these latents only activate in these cases, so the latents remain generally independently interpretable.
Sampling points from one PCA extrema to another, we find that latent activation within the cluster is split across multiple latents in the cases that are between the extremes. The activation is predominantly split between the latents that define the different extremes, but the other latents in the cluster activate weakly as well. This may represent that the ambiguity of prompts far from the extremes causes multiple latents to activate in order to accommodate the different potential readings, representing 'uncertainty' in the model (Figure 27 and Appendix Figure 23, Appendix Figure 24). It is notable that the examples where multiple features are active are more difficult to distinguish between a question vs a statement (e.g. 'how I train you' could be a question or a statement, whereas 'How did you guys meet?' and 'how difficult this issue is' are more clear, see Figure 27).
Figure 27 Change in latent activation for examples on the continuum from 'how' as a question and 'how' as degree: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the main latent relating to 'how' as a question (latent ID 817) and 'how' as a matter of degree (latent ID 21576) for examples in the animation. For trend in all points along PC2 see Appendix Figure 24.
Distinguishing the type of entity whose possession is indicated by an apostrophe in Gemma-2-2b
In English, an apostrophe can be used to denote that the preceding word is a possessor e.g. 'It was Adam's apple'. The type of possessor can be a named person, a generic person (e.g. 'The Defendant's case was rejected'), a collective 'The company's profits decreased' or non-human or inanimate objects 'misalignment between the valve'|s| stem and seat'. We observe a cluster in layer 12 of Gemma-2-2b (cluster 4334) that appears to disambiguate these different cases, with a similar graph of co-occurrences to that observed in the prior example. (Figure 28), firing either on the 's' after the apostrophe (e.g. valve'|s|), or the apostrophe in cases where an 's' is conventionally omitted (e.g. sellers|'|).
Figure 28: Cluster that disambiguates possessor denoted by apostrophe: (Left) PCA of subspace mapped by SAE latents (Gemma-2-2b, layer 12, cluster 4334), colour is most active latent. (Right) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate.
The vertices of the PCA have the maximal activity of three of the latents, distinguishing cases:
(Left) References to the properties of inanimate or non-agentic objects (e.g. 'the *y*-axes|'| ranges'), primary latent 5799.
(Bottom) References to specific people or groups (e.g. 'Fisher'|s| exact test'), primary latent 9754.
(Top) References to generic people ('the sellers|'| promise'), primary latent 4572.
See Figure 29.
Context
" between distributions. Moreover, the *y*-axes|'| ranges were chosen to make the *I* and"
'$ is so much larger than the square lattice’|s| threshold $p_{c,s} = '
"int).\nYou should instead set the Rectangle fields|'| values with your constructor, and then have a separate"
"kyachuga is sent to defend Planet Southern Cross|'|s core against the Kyurangers before being destroyed"
"\u2009+\u2009T0). The quasistatic step'|s| work is[@b25][@b2"
' herein to include extrapolation.) The diffractive structure""|s| internal geometry need not be modeled, and electromagnetic interactions'
' player’s ability and a tool for predicting games|’| outcomes. The system has been tweaked over the years'
" of Landauer's principle. The result'|s| significance is that it opens new avenues of thought and"
".\n\nBuyers' Premium and Charges\n\nBuyer'|s| Premium Rates25% on the first $1"
", please leave now.<bos>Conversations (Woman'|s| Hour album)\n\nConversations is the debut album by"
' easy with LG online service and support. Owner’|s| Manuals, requesting a repair, software updates and warranty'
' drawn from $\\mathbb{P}$ using Fisher’|s| conditional correlation test with significance level $\\alpha = '
"984); see Winchester v. Lester'|s| of Minnesota, Inc., 983 F"
" variables were analyzed by χ2 test or Fisher'|s| exact test. Difference was considered statistically significant when p"
"ARK TRIP, FEB 22: WALKER'|S| CAY\n\n21:41 - As"
Context
" the buyers changed their position in reliance on the sellers|'| promise to paint the front of the house. Thus"
"2002. It is apparent from plaintiffs|'| papers that they intended to deduct these fees from their"
" scour the purchase agreement looking for the source of defendants|'| claim. At that point, after the entry of"
" and those records are more than adequate to meet plaintiffs|'| burden.\nDefendants also invite the court to"
" Regarding Jury Deliberations\n\nWe next consider the plaintiffs|'| argument that the trial court erred in denying their post"
' playgrounds, were joined by an increasing awareness of Majors|’| sex symbol status for young women -- with the actor'
" alone lay persons on a jury. To prove plaintiffs|'| claim at trial, plaintiffs' counsel could not rely"
"unconstitutional. The only issue presented in Jones|'| portion of the case is\n\nwhether an incumbent judge"
Figure 29:Subspace maps possession by named person or group, generic person, or non-human or inanimate objects: (Left) SAE latent activation strength for all examples in PCA for latents (latent 5799 (top), latent 9754, (centre), latent 4572 (bottom). (Right) Context of tokens activating latest in the cluster for these extrema of the PCA. See also Appendix Figure 25
Between these vertices we see an approximately linear decay of the primary latent seen at one axis while the other latent increases in activity (see Figure 30 and Appendix Figure 26), suggesting that, as in GPT2-Small, in cases of ambiguity between different interpretations of a token, Gemma-2-2b SAE latents will co-occur more strongly.
Figure 30: Change in latent activation for examples on the continuum from named persons or groups to generic persons: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the latents for selected points between extrema of the PCA. See also Appendix Figure 26.
We observe many such cases of apparent disambiguation in both GPT2-Small, and Gemma-2-2b, see Appendix: Case Studies and the SAE Latent Co-occurrence Explorer.
Discussion
The ideal features to describe an LM would be independent and linear (Park et al., 2023), but it may not be possible to extract such features. SAE latents have many desirable properties for explaining LM behaviour based on their internals, but recent work has shown that they are not always linear (Engels et al., 2024). In this work we investigate whether SAE latents fire independently, how this depends on SAE size and architecture, and what this means for SAE latent interpretability.
First, we show that SAE latents co-occur both more and less than expected if they were independent, both for ReLU (Bloom, 2024, GPT2-Small) and JumpReLU (Gemma-Scope, Gemma Team, 2024, Lieberum et al., 2024, Rajamanoharan et al., 2024a) SAEs. Secondly, we find that this phenomenon becomes less prevalent as SAE width increases.
However, we also observe that in cases of SAE co-occurrence more than expectation, i.e. in clusters of SAE co-occurrence, these clusters map out interpretable subspaces, with interpretations including: days of the week, recapitulating Engels et al, 2024; disambiguation of the grammar of a token; and measuring the position of tokens within a string.
We observe two subgroups of clusters, which appear to be driven primarily by either: compositionality in the underlying LM features; and by ambiguity and multiple meanings of words in natural language. The former is particularly surprising, given the bias against finding the underlying features that act compositionally using SAEs, and the lack of such recovery in toy examples (Anders et al., 2024). This suggests that SAEs can extract compositionally encoded features, and that co-occurrence clustering may be a method to detect this behaviour. Indeed, it may be necessary to consider clusters as a whole in order to properly interpret SAE latents in some cases. For example, the role that the hub latent in the days-of-the-week cluster (latent 8838) plays in composition with other day of the week latents is not apparent from examining it alone, as it appears very similar to the leaf latents (e.g. latent 3266) when examining e.g. tokens promoted and max activating examples of prompts. Similarly, we observe a cluster with many latents that maximally activate on the token 'first' but in composition appear to be specialised to distinguish 'first', 'second' and 'third' (Appendix Figure 17). However, this composition does appear to be predictable from the activity of latents examined alone, e.g. in the case of qualitative amounts in Gemma-2-2b, we observe that the relative activation of latents active for 'some of' (latent 15441) and 'all of' (latent 12649) predictably and continuously correlate with a range of tokens from 'many of' to 'almost all of'.
We also observe clusters that form on 'ambiguous' tokens, those that have multiple meanings e.g. 'how'. This multi-layered meaning of words already shows signs of being encoded in the decoder weight structure of SAE latents (Bussmann et al., 2024, Shai et al., 2024), but here we observe a potential mechanism by which different meanings can be compared and weighted against one another. It is unclear how this changes as SAE width increases, there may be latents dedicated to more fine grain meanings, but this is unlikely to be able to represent as many weightings between possible meanings as a mixture of activity in co-occurrence.
Fortunately, within the subspaces mapped out by these clusters, SAE latents remain largely independently interpretable, and so we can be optimistic about the potential for SAE latents in general to be easily interpretable at scale. Furthermore, this phenomenon appears to decrease in relevance as SAE width increases. Similarly, due to the expense of training high width SAEs, being able to find and interpret clusters will be an important aspect to SAE based mechanistic interpretability in many cases. Finally, recent work has suggested that the optimal SAE size depends on the task (Karnoven et al., 2024). As compositional encoding may be easier to interpret than a large local code, our work shows another potential use of smaller SAEs.
Future Work
Alternate LM and SAE architectures
This work focussed on GPT2-Small (Radford et al., 2019) and SAEs trained with a standard ReLU (Bloom, 2024) and Gemma-2-2b (Gemma Team, 2024), using Gemma Scope (Lieberum et al., 2024), which uses JumpReLU (Rajamanoharan et al., 2024a). We find that the clusters in Gemma-2-2b tend to be less interpretable in general, and that, unlike GPT2-Small, the most active latents in a subspace mapped by a cluster are less likely to be the latents within the cluster. We also observe that latents that would be expected to co-occur do not, e.g. latents for 'two', 'three' etc co-occur, but not the latent for 'one' in layer 0 of Gemma-2-2b (see Figure 24). It is unclear whether this is a property of the SAE or of the underlying model. This could be clarified by expanding this work to SAEs trained on e.g. Llama3.1-8b and comparing to other SAE architectures trained on the same models, e.g. BatchTopK (Bussman et al., 2024), GatedSAE (Rajamanoharan et al., 2024b) and end-to-end (e2e) SAE (Braun et al., 2024).
Effect of SAE Sparsity on co-occurrence
Similarly, Gemma Scope SAEs have been released with different levels of L0 sparsity (Lieberum et al., 2024). We find that in the case of clustering on qualitative amounts (see Figure 11) that similar clusters exist for lower L0 SAEs that appear to be less compositional. It may be fruitful toexplore the effects of this on co-occurrence further, with the hypothesis that higher L0 sparsity will lead to less compositional encoding and thus fewer co-occurrence clusters of this type.
Understanding the drivers of co-occurrence
Why co-occurrence occurs, why it decreases with larger SAEs and whether this will continue until all latents are independent, or whether some co-occurrence is unavoidable, requires further exploration. In particular, it may be that composition-driven and ambiguity-driven co-occurrence display different patterns in this regard.
It seems plausible that composition may occur especially in smaller SAEs, where only more coarse-grain latents can be learned. This suggests that a potential cause for SAE latent co-occurrence is that small SAEs, which can only generate a small number of latents, are likely to extract more general latents than large SAEs e.g. a small SAE might have a 'red' latent and a 'circle' feature, whereas the larger SAE could learn a more fine-grained latent such as 'red circle'. This means that although SAE latents are optimised for sparsity and therefore biassed towards a local code representation (Olah, 2023) of the true, underlying model features (Till et al, 2024, Anders et al., 2024), a small SAE cannot but have some compositionality if it is to minimise reconstruction loss.
For example, the url-subdirectory cluster separates tokens that are only a single character apart in strings that are approximately 10 characters long with only 5 latents, suggesting a compositional encoding, but a larger SAE may be able to dedicate a single latent for each position in the string, obviating the need for this. Thus, this may not reflect composition in the underlying LM at all. One way to explore this will be to explore whether these types of clusters occur less in larger SAEs. As compositional codes can be more interpretable, this may prove to be an advantage of smaller width SAEs.
A cluster formed due to ambiguity, on the other hand, may only grow in size as SAE latents split (Makelov et al., 2024), as more potential meanings can be assigned explicit latents, but nevertheless may need to be active at the same time to capture intentional ambiguity in, for example, word-play, humour and poetry. Conversely, a large SAE might be able to assign separate latents for a token used unambiguously as well as all possible ambiguous combinations, although this would not allow the weighting of these against one another. This latter case is complicated by the observation that rather than splitting into single representations of a concept, latents may instead have certain functions 'absorbed' by other latents as splitting occurs (Chanin et al., 2024). We may also expect such co-occurrence to be driven by other kinds of ambiguity e.g. when a model is 'considering' two potential courses of action.
Accuracy of compositional encoding
We find a cluster that appears to function as a way of measuring position in a string, in this case the subdirectory of a url. If this is truly acting compositionally, we would expect to be able to recover position with a greater accuracy than the number of latents involved, and so we aim to compare how well classifiers can recover token position based on SAE latent activation and the underlying neuron activations.
If this is the case, it further raises the question of whether there are other such specified positional latents, whether these latents have other properties in common with positional latents (Chughtai et al., 2024), and whether this provides the basis of a method to find other positional latents in general.
Similarly, initial extraction of the clusters from co-occurrence is difficult because there are many low weight edges and high degree nodes. Extracting nodes of the highest degree found latents that had high cosine similarity with the position embedding matrix, suggesting these are positional latents, as seen by Chughtai et al., 2024. This may be another method to extract positional latents more generally.
Can we relate co-occurrence to causal mediation in a meaningful way?
Figure 31: Example of tree of rules for classification. Simplified case of classifying a fruit into botanical categories.
We find examples of co-occurrence creating a subspace that can be understood as performing classification e.g. the 'how' cluster classifying different grammatical uses of the word. In cases with more defined rules e.g. fruit classification (Figure 31) one would expect that latents would form clear, nested relationships that can be recovered by our method and used to derive the underlying ruleset. For example, in this toy case 'peach' and 'plum' would co-occur with one another, and also with both 'fleshy fruit' and 'large pit'; but 'grape' and 'tomato' would co-occur only with 'fleshy fruit'. Thus the classification rule-set can be recovered from the nesting of co-occurrence of latents, and this can therefore serve to identify rule-based classifying circuits within LM. However, this assumes that both the coarse-grained (e.g. 'fleshy fruit') and fine-grain (e.g. 'plum') latents exist within the same SAE of a certain width, which might not be the case (see Bussmann et al., 2024 and Bussman et al., 2024).
Conclusion
SAE latents will be most easily interpretable if they are independent. We find that this is not the case, and instead that latents form clusters of co-occurrence that map out interpretable but often non-linear subspaces. Nevertheless, the latents remain independently interpretable in many cases, and this behaviour decreases in prevalence as SAE width increases. Despite this, we observe cases where one can only understand a latent in the co-occurence of its co-occurrence relations. Understanding the drivers of this will be an important part of ensuring that SAEs, and mechanistic interpretability, are useful for the wider goal of ensuring safe AI. This is because to predict and correct unsafe behaviour we require model features that do not only correspond to one part of this behaviour, or only in certain contexts, but rather an exhaustive understanding of all routes to unsafe behaviour. This work shows one part of how this can be accomplished, by demonstrating how SAE latents can form larger functional units that we can detect and understand.
Thanks to Clem von Stengel, Andy Arditi, Jan Bauer, Kola Ayonrinde and Fernando Rosas for useful discussions, and Owen Parsons for feedback on the draft. Thanks to the entire PIBBSS team for their support and for providing funding for this project. Thanks also to grant providers funding Joseph Bloom during the time he mentored this project.
Years: hub activity appears to correspond to whether there is a space in a token.
Gemma-2-2b
Amount 'of': e.g. latents firing for 'some of' and 'all of' denote these groups when active with the latent for 'one of', but when combined denote 'many of', see main results.
Regular vs superlative/comparative forms of low and high: latent 8811 appears to correspond to the regular (e.g. high) vs superlative/comparative (e.g. highest/higher) form of a word, while the other latents control if the word being so modified is 'low' or 'high'
Type of name with apostrophe: cases of inanimate or non-human objects possessing a quality vs generic persons possessing something, vs named person or groups possessing something, as denoted by an apostrophe
Number and fraction of SAE latents active in Gemma-2-2b for different SAE widths and L0
Appendix Figure 1: (Left) Change with SAE width (Right) Change with SAE mean L0. (Red) Number of latents active. (Blue) fraction of latents active.
SAE latent co-occurrence vs expectation for different SAE widths in layer 8 of GPT2-Small
Appendix Figure 2: (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths with y rescaled to log10 for GPT2-Small. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
SAE latent co-occurrence vs expectation for different SAE widths in layer 12 of Gemma-2-2b
Appendix Figure 3: (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths with y rescaled to log10 for Gemma-2-2b. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
SAE latent co-occurrence vs expectation for different SAE L0 in layer 12 of Gemma-2-2b
Appendix Figure 4: (Left) Boxplot of SAE latent co-occurrence per token for different SAE L0 with y rescaled to log10 for Gemma-2-2b. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
Mean subgraph size with SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 5: (Left) mean size of clusters as SAE width increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed) in GPT-2. (Right) in Gemma-2-2b.
Mean subgraph size with SAE L0 for Gemma-2-2b
Appendix Figure 6: (Left) mean size of clusters as SAE mean L0 increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed).
Mean feature and subgraph sparsity vs SAE width for GPT2-Small and Gemma-2-b
Appendix Figure 7: L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) vs SAE width (dashed) (Left, GPT2-Small, right, Gemma-2-2b).
Mean feature and subgraph sparsity vs SAE L0 for Gemma-2-b
Appendix Figure 8: L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) vs SAE L0 (dashed).
Fraction of SAE latents in a cluster vs SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 9: (Left) fraction of latents in cluster as SAE width increases in GPT-2. (Right) in Gemma-2-2b.
Fraction of SAE latents in a cluster vs SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 10: (Left) fraction of latents in cluster as SAE L0 increases in Gemma-2-2b.
Appendix Figure 11: Month-of-the-year cluster. (Top left) Cluster of extracted latents corresponding to months of the year shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'October') or with a space (e.g. ' October'). (Bottom left) Activation strength of 'spoke' latent ID 3877. (Bottom right) Activation strength of 'hub' latent ID 10676.
We observe a similar phenomenon with months of the year, with only two rings of latents, and the inner ring once again being defined by the hub latent (latent ID 10676), and this hub latent once again denoting the lack of a space in the activating token (Appendix Figure 11, cluster 2644 in our app). Interestingly, this cluster also lacks a 'spoke' latent for the month of May, despite there being an SAE latent that appears to correspond to this month (latent ID 21089). This SAE latent is isolated (does not form a cluster with anything else).
Appendix Figure 13: (Top) PCA analysis of only those cases in which the token that activates the cluster is a single character, highlighting how far into the subdirectory the token fires (e.g. where the token that causes the SAE latent to activate is surrounded in '|', then in the string .twitter.com/|e|2zNEIdX is the 0th character, .twitter.com/e2|z|NEIdX is the third etc). (Bottom) Mean activation of each SAE latent for each position in the url subdirectory.
Measurement of token position in url subdirectory in layer 8 of GPT2-Small with 24576 width SAE (cluster 125)
Appendix Figure 14: Lowercase character detection in url subdirectories. (Left) number of times a latent occurs in samples of prompts containing activations of the SAE latents in cluster 125 for GPT2-Small, 24K width SAE, layer 8 (see list here). (Right) Maximum activating tokens shown in green for latent ID 19054.
Day of the week latent in layer 0 of GPT2-Small with 24576 width SAE
Appendix Figure 15:Day-of-the-week cluster showing only full days of the week. PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'Monday') or with a space (e.g. ' Monday') but not including shortened cases e.g. 'Mon'. Note how compared to Figure 17 this mainly affects the innermost ring, i.e. where the 'spoke' latent activation is weakest.
Ordinal numbers (e.g. first, second, third) show switch from local code to encoding in strength of latent activation (Gemma-2-2b, layer 21, cluster 511)
Examining the 16K SAE for layer 21 of Gemma-2-2b, we find a cluster of latents that activate on ordinal words (e.g. 'first', 'second', 'third') (cluster 511). We observe that the PCA of prompts that contain tokens that activate these latents separates into clusters in the order of these ordinal words up until 'fourth'/'fifth' after which there is no longer clear separation (see Appendix Figure 16). To clarify this we also perform PCA for a custom set of prompts with equal numbers of ordinal words from 'Zeroeth' to 'Tenth' (Appendix Figure 17).
Appendix Figure 16: (Top) PCA of subspace mapped by SAE latents, colour is ordinal number in the activating token. (Bottom, left) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate. (Bottom, right) Mean activation strength of SAE latents within cluster for different ordinal numbers.
We observe that the lower ordinal words have latents that activate more specifically, although note that three of the latents activate strongly on the word first in Neuronpedia max activating examples (latents 2795, 6539 and 7341), whereas higher numbers activate the latent associated with higher numbers (latent 901) more strongly. However, analysing these as a cluster, the relative strength of the activations of these latents suggests a different interpretation, with latent 7341 activating most strongly on 'first', but 2795, 6539 activating roughly equally on all ordinal words (see Appendix Figures 15, 16). This again shows how interpreting latents that form co-occurrence clusters as a group and in context is important and can allow us to reveal the differences between apparently redundant SAE latents.
Thus there is a transition from local encoding of the words with a single latent, to encoding by the the strength of latent 901, or possibly the relative strength of latent 901 vs latent 523 (Appendix Figure \16).
Appendix Figure 17: (Top) PCA of subspace mapped by SAE latents, colour is ordinal number in the activating token, using custom prompts to ensure equal numbers of ordinal words in the dataset. (Bottom, left) Mean activation strength of SAE latents within cluster for different ordinal numbers. (Bottom, right) Relative mean activation strength of SAE latents within cluster for different ordinal numbers (normalised to sum to one).
Appendix Figure 18: Mean relative activation strength of latents in Gemma-2-2b, layer 0, cluster 1370 for different number words. Normalised to be equal strength to control for fewer cases of higher numbers and show relative activation strength.
Appendix Figure 19: (Left) Training metrics for linear probe for number words in layer 0 of Gemma-2-2b. (Right) Heatmap of raw co-occurrence between top 10 latents most cosine similar to the direction of the probe, red highlights indicates pairs connected in Layer 0 Cluster 1370 in Gemma-2-2b.
Classification of the uses of the word 'how' in layer 8 of GPT2-Small with 24576 width SAE (cluster 787)
Exploration of the subspace disambiguating 'how' in layer 8 of GPT2-Small with 24576 width SAE (cluster 787)
Context
"to craft guidelines on| how| doctors can ethically use
ÔøΩ massive partisan divide over| how| they view the threat from
come up with plans of| how| customers can withdraw their funds
and for more clarity on| how| the different
soon to draw conclusions about| how| much wetland methane emissions
fed into the debate surrounding| how| to bring down the 50
the UK's demands on| how| the EU should change –
Police gave us recommendations on| how| to secure the facility,
with our international partners on| how| weÔøΩÔøΩll use
it would disclose details on| how| climate change may affect its
Appendix Figure 21: Use of 'how' to mean manner for example 'to craft guidelines on| how| doctors can ethically use': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 11726 and latent ID 23664 in layer 8 of GPT2-Small with 24576 width SAE (cluster 787).
Context
.AG:| How| did you guys meet?
was wrong.|How| is it that you can
10, top panel).| How| bad is his rampage of
chosen representative.|How| important are early
third quarter.|How| much Orange County housing can
new isn't new?| How| do you not realize you
<|endoftext|> said. ÔøΩÔøΩ|How| do you say that to
smith: Good morning.| How| are ya?
a rotten parking job.| How|'s that red truck going
<|endoftext|> this gas station.| How| about it?ÔøΩÔøΩ
Appendix Figure 22: Use of 'how' to mean manner for example '. AG:| How| did you guys meet?': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 817 in layer 8 of GPT2-Small with 24576 width SAE (cluster 787).
Appendix Figure 23: Change in latent activation for examples on the continuum from 'how' as a matter of degree and 'how' as matter of manner: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in GPT2-Small, Layer 8, 24K width SAE, cluster 125.
Appendix Figure 24: Change in latent activation for examples on the continuum from 'how' as a question and 'how' as degree: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the main latent relating to 'how' as a question (latent ID 817) and 'how' as a matter of degree (latent ID 21576) for all examples in the entire PCA, plotted against PC2 in GPT2-Small, Layer 8, 24K width SAE, cluster 125.
Appendix Figure 26. (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the latents for selected points between extrema of the PCA (from named persons to inanimate or non-human objects) in Gemma-2-2b layer 12 cluster 4334.
Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom
This work was produced as part of the PIBBSS program summer 2024 cohort.
tl;dr
We made a website that lets you explore SAE latent clusters.
Summary
Sparse AutoEncoder (SAE) latents show promise as a method for extracting interpretable features from large language models (LM), but their overall utility for mechanistic understanding of LM remains unclear. Ideal features would be linear and independent, but we show that there exist SAE latents in GPT2-Small and Gemma-2-2b that display non-independent behaviour, especially in small SAEs. Rather, latents co-occur in clusters that map out interpretable subspaces, leading us to question how independent these latents are, and how does this depend on the SAE? We find that these subspaces show latents acting compositionally, as well as being used to resolve ambiguity in language, though SAE latents remain largely independently interpretable within these contexts despite this behaviour. Latent clusters decrease in both size and prevalence as SAE width increases, suggesting this is a phenomenon of small SAEs with coarse-grained latents. Our findings suggest that better understanding of how LM process information can be achieved in some cases by understanding SAE latents as being able to form functional units that are interpreted as a whole.
Introduction
In order to understand language model (LM) behaviour, we require a way to decompose their internals into interpretable, functional subunits, i.e. 'features'. These subunits must be mathematically defined in relation to the underlying LM, able to be given a clear semantic description, and of use to make testable predictions (Sharkey et al., 2024). There has been success in semantic labelling of neurons (Bills et al., 2023), as well as using them to extract interpretable circuits (Wang et al., 2022, Nanda et al., 2023). However, we expect that neurons are in general hard to interpret, as an LM must be able to respond to, and respond with, more concepts than there are neurons. Thus neurons are often polysemantic, e.g. due to superposition (Elhage et al., 2022). Therefore, we require a better basic unit of model internals to work with, ideally ones that are monosemantic, linear and independent.
Recently, Sparse AutoEncoders (SAEs) latents have shown promise as meeting all the criteria proposed by Sharkey et al., 2024 for features for mechanistic interpretability, in that: they are conceptually simple in relation to the underlying model; their latents are often readily interpretable (Bricken et al., 2023, Huben et al 2024); and have may be causally relevant such that we can steer with them (Templeton et al., 2024). Therefore, there is hope that these SAE latents can be used as a good proxy for underlying LM features needed for scalable and effective mechanistic interpretability.
Ideally, it would be possible to extract LM features that are linear (Park et al., 2023) and independent. Recent work by Engels et al., (2024) shows that SAE latents are not all linear, but rather that some latents, such as those that fire on days of the week, map out irreducible, multi-dimensional subspaces.
In this work, we show:
Cluster formation in the case studies we observe often fall into two categories:
We investigate these behaviours in case studies using both GPT2-Small and the more recent and larger model Gemma-2-2b. We also present a Streamlit app allowing for exploration of a subset of the clusters we find at https://feature-cooccurrence.streamlit.app/, which we link to for all examples in this post, see also Appendix: Case Studies. Code to generate clusters and plot the figures from this paper is available at https://github.com/MClarke1991/sae_cooccurrence.
Background
Superposition: LM must be able to represent many features, often more than they have neurons. This leads neurons to be polysemantic. It is further hypothesised that this compression can extend to representing more features than there are dimensions by using sparse, but not orthogonal embeddings of features (Elhage et al., 2022), i.e. superposition.
Sparse AutoEncoders: An autoencoder learns efficient encodings of input data x in an unsupervised manner by learning two functions: an encoding function WE + bE and a decoding function WD + bE. Between application of these matrices a nonlinearity such as a ReLU is typically applied, although at time of writing these differ between many competing architectures (Bussman et al., 2024, Rajamanoharan et al., 2024a, Rajamanoharan et al., 2024b, Braun et al., 2024). In a sparse autoencoder, the encoding is trained such that the output of WE has as few active states as possible. This is typically achieved by a loss function that, in addition to penalising for errors in the reconstruction of the input data, penalises the L1-coefficient of the SAE encoding. Note that this means that the sum of the activations of the hidden layer are optimised to be less than or equal to 1, but not that the number of activations is 1. When the output of WE, i.e. the width of the SAE, is large, then it is observed that the SAE latents behave like interpretable features of the underlying model (Bricken et al., 2023, Huben et al 2024, Templeton et al., 2024).
Note that we follow (Lieberum et al., 2024) refer to the directions learned by the SAE as latents, to disambiguate from the underlying LM features, in contrast with earlier work that uses 'feature' for both (e.g. Bricken et al., 2023, Rajamanoharan et al., 2024a, Rajamanoharan et al., 2024b).
Related Work
Sparse AutoEncoders for Feature Extraction: While there has been success in finding interpretable neurons in LM (Bills et al., 2023), many are polysemantic and hard to interpret, and so make poor features (Sharkey, 2024). SAEs have been shown to be able to extract interpretable latents (Huben et al., 2024) in both small and large LM (Bricken et al., 2023, Templeton et al., 2024, Lieberum et al., 2024), in part through taking features out of superposition (Sharkey et al., 2022). Steering on SAE latents (Templeton et al., 2024) had early success but has recently been shown to be prone to off-target effects, casting doubt on the reliability of SAE latents for interpretable and predictable steering (Anthropic, 2024). Similarly, it has been shown that SAE latents can 'absorb' some of the function of otherwise monosemantic latents (Chanin et al., 2024). Optimal SAE architecture is not yet clear, with recent work collating benchmarks of various such failure modes (Karnoven et al., 2024).
Linear Representations: It has been proposed that concepts might be represented linearly in large language models (Park et al., 2023) despite superposition (Elhage et al., 2022). The weak form of this, that many features are linear, is supported by work showing that categorical concepts form linear hierarchies in LM such as Gemma (Park et al., 2024).
Non-linear Representations: SAE latents do not appear to be linear in all cases, with some forming multi-dimensional, irreducible subspaces (Engels et al., 2024). This suggests the driver behind neuron poly-semanticity may be more complex than superposition (Mendel, 2024), and in turn casts doubt on the idea that an LM can be decomposed into purely monosemantic features in principle, (Smith, 2024). Recent work suggests that SAE latents can themselves be further decomposed into meta-latents (Bussmann et al., 2024) highlighting their own polysemantic nature.
Compositional Features: Concepts in input and output data of LM can be encoded as local, compositional or similar codes (Olah, 2023). SAEs are not designed to decompose features (Till et al., 2024), and are poor at recovering such (Anders et al., 2024), but there has been recent work on interpreting SAE latents in light of this (Ayonrinde et al., 2024, Anthropic, 2024).
Ambiguity: Ambiguity is a key part of natural language, but poses a difficulty for placing robust constraints on LM behaviour through approaches such as system prompts, and may contribute to problems such as inferring user goals (Christiano (2018)), mis-generalisation from training (Shah et al., 2022) and the inability of models to consistently understand when they should give fictional or factual answers (Huang et al., 2023). Efforts to address this include the construction of datasets of ambiguous and clarified sentence pairs (Liu et al 2024), measurement of LM uncertainty through exploitation of semantic ambiguity (Kuhn el al., 2023) and training of LM to ask for clarification when prompted with an ambiguous sentence (Kim et al 2024). Studies have also been done on recovering the belief states of LM from the residual stream (Shai et al., 2024).
SAE Latent Co-occurrence: Parsons et al., 2024 explored whether co-occurrence defined by latent attribution score correlation revealed latents that acted differently when steered on together, compared to random pairs of latents.
Methods
Figure 2: Overview of extraction of co-occurrence clusters
Our method consists of 6 stages (see Figure 2).
Measuring SAE latent co-occurrence
We use the pre-trained GPT2-Small residual stream SAEs, including SAEs of different sizes trained on layer 8 of GPT2-Small (sizes 768, 1536, 3072, 6144, 12288, 24576, 49152), as well as the Gemma-scope SAES for Gemma-2-2b, using SAE Lens. Using the record of SAE activations from training (the ActivationsStore), we extract batches of activations, each of 4096 tokens, and record which SAE latents were active in these. We further consider a latent to be active only if it is active above a threshold of 1.5 for GPT2-Small, unless otherwise stated, to reduce noise.
Figure 3: SAE latent co-occurrence per token in GPT2-Small: (Left) Co-occurrence of latents on the same token across a sample of 2.048 x 106 tokens using the Neuronpedia 768 width SAE for GPT2-Small after Jaccard Normalisation. (Right) Mean fraction (blue) and number (red) of SAE latents firing per token in the same sample size for all SAE widths in the Neuronpedia GPT2-Small feature splitting dataset.
We count the number of co-occurrences per token for different layers (0 and 8 in GPT2-Small, 0, 12, 18 and 21 in Gemma-2-2b). We observe that the mean number of latents active per token rises with SAE size but seems to stabilise around 30 latents per token in GPT2-Small (Figure 3, Appendix Figure 1). This means that the fraction of the latents of an SAE occurring per token reduces with size as the SAE width increases, in GPT2-Small and Gemma-2-2b (Figure 3, Appendix Figure 1).
Generating graphs of strongly co-occurring communities of SAE latents
Figure 4: Normalisation by Jaccard Similarity: (Left) Histogram of frequency of latent occurrence in the dataset. (Right) Schematic of Jaccard Similarity: the size of the intersection divided by the size of the union of the sample sets.
We further observe that a small subset of latents are active on very many tokens (Figure 4). To allow better comparison of the likelihood of co-occurrence between latents, and to weight for co-occurrences that are not due to the activations of one latents being a superset over all tokens, we chose to normalise by Jaccard Similarity (Jaccard, 1912) (Figure 4), i.e. we consider the strength of co-occurrence between two latents to be scored as: J(A,B)=|A∩B||A∪B|=|A∩B||A|+|B|−|A∩B|
Figure 5: Process of edge removal to form unconnected subgraphs: (Top) Comparison of the number degree (number of edges) of nodes after Jaccard similarity normalisation (blue) and after further removal of low weight edges. (Bottom) Graph of co-occurrences before (left) and after (right) removal of low weight edges (right side shows only graphs of size greater than 1 node).
We represent the co-occurrences as a graph (Figure 5). After normalisation, we find a very large number of low-weight edges, associated with a high-degree for nodes in the graph. In the expectation that communities within this graph will represent the most useful clusters of SAE latents, we remove low weight edges through a binary search for a threshold that leads to the largest subgraph size being ≤ 200. This decomposes the graph into many subgraphs of size 1, which we refer to as isolated latents, as well as communities of > 1 latents which we investigate further.
Note, in the case of Gemma-2-2b, we found strong co-occurrence driven by activation on the BOS token, which had been removed prior to SAE training in the case of GPT2-Small, and this led to our binary search finding a very high threshold for edge weights to generate clusters of the required size. We therefore ignore cases where SAE latents are active on special tokens (PAD, BOS, EOS) when calculating the rate of feature occurrence and co-occurrence in the case of Gemma-2-2b.
Mapping subspaces using SAE latent clusters
Figure 6: Principal Component Analysis of a resulting subspace: We find prompts that contain activations of the SAE latents in a cluster, and then use a PCA to represent the vectors made up of these activations from the SAE latents within the cluster. This example is explored further in Figure 16.
We search through the training data for examples of prompts that activate latents within a cluster. We then use principal component analysis (PCA) to represent the vectors made up of the same latent activations from only the SAE latents in that cluster and so explore if these latents are more explicable as a group (Figure 6).
Code to generate clusters and plot the figures from this paper is available at https://github.com/MClarke1991/sae_cooccurrence. Selected results can be explored in https://feature-cooccurrence.streamlit.app/.
Results
SAE latents co-occur both more and less than chance
If we want to reason about latents as monosemantic units, then we want to know that our interpretation of a latent can be reasoned about independently of other latents. To assess this, we measure the rates of co-occurrence of latents with one another on the same token, and compare this to the expected rate of co-occurrence if the latents were independent (E[co-occurrences(i,j)] = p(i) * p(j) for latents i, j).
Figure 7: Co-occurrence as SAE size increases in GPT2-Small. (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths (top, with outliers, bottom, without outliers). Blue: observed co-occurrence, red: expected co-occurrence.
We measure the rate of co-occurrence for a range of SAE widths probing the residual stream of layer 8 of GPT2-Small and layer 12 for Gemma-2-2b. We find that the rate of co-occurrence per token decreases monotonically as SAE size increases (Figure 7, Appendix Figure 2, 3, 4). We further observe that the distribution of latent co-occurrence is broader than the expectation, with many latents co-occurring much more rarely, but some occurring more frequently than expected. These latter decrease in prevalence as SAE size increases (Figure 8, Appendix Figure 2, 3, 4), suggesting that the increased granularity of latents allows them to fire more independently, and that this phenomenon will reduce in importance.
Figure 8: Density of co-occurrence in GPT2-Small Density of rates of co-occurrence (blue observed, red expected) for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Observed co-occurrence is always an integer, and so cannot generate a result lower than zero on the x-axis, hence we plot from Log10 Co-occurrence of zero.
SAE co-occurrence clusters are smaller and fewer in large SAEs
We measured the size of the clusters as the SAE size increased, and observed that subgraph size decreases as SAE size increases, suggesting fewer significant co-occurrences as the number of latents grows (Figure 9, Appendix Figure 5). This is clearest when considering subgraphs that are of size > 1 i.e. where there is at least one case of co-occurrence. We similarly see that the mean subgraph size decreases as L0 decreases in Gemma-2-2b (Appendix Figure 6). However, we find that the L0 sparsity plateaus as SAE size increases in the case of GPT2-Small, both for SAE latents considered individually and when considering clusters as a unit (GPT2-Small Figure 9, Gemma-2-2b Appendix Figure 7). We find that the sparsity of subgraphs matches that of features closely in Gemma-2-2b when directly comparing the same layer with different L0 (Appendix Figure 8). We also see a decrease in fraction of features in a cluster as width increases, though this plateaus (Appendix Figure 9, 10).
Figure 9: Change in cluster (subgraph) size with SAE width in GPT2-Small. (Left) mean size of clusters as SAE width increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed). (Right) L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) (dashed).
Co-occurrence relations detect groups of latents that map interpretable subspaces and can be interpreted as functional units
Through qualitative analysis of clusters, we find that co-occurrence clusters form groups that seem a priori coherent when looking at, for example, the token promoted, such as a cluster of months of the year (Figure 10). In order to explore this further, we searched for examples of text that activated latents within a cluster using the activations store, and plotted a PCA of latent activations for these examples.
Figure 10: Example clusters (GPT2-Small): Cluster of latents shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Left) Example from case study of days-of-the-week (see Appendix Figure 11) and url-subdirectory position (see Figure 15).
Co-occurrence can be used to map cases of compositionality between latents
SAEs trained on Gemma-2-2b (Gemma Scope) encode qualitative statements about the number of items compositionally
Figure 11: Composition of latents map to interpretable regions of subspace mapped by cluster: (Left) PCA of subspace mapped by SAE latents, colour is most active latent, annotations show clusters defined by different compositions of latents (hub (red, latent 12257) is active in all cases, excluded from annotation for clarity) (Gemma-2-2b, layer 12, cluster 4740). (Right) Graph of co-occurrence between latents in the cluster, coloured to match left-hand plot. Note the presence of a single 'hub' latent and several 'spoke' latents.
We generate co-occurrence clusters for all SAE latents for the set of 16K width SAEs trained on Gemma-2-2b, for layers 0, 12, 18 and 21. Of these, we searched through PCA for clusters of size 5-7 for interpretable cases.
We find that cluster 4740 of layer 12 fires on the token ' of' and appears to separate cases of different qualitative descriptions of amounts, e.g. 'one| of|' vs 'some| of|' vs 'all| of|' (e.g. 'also known as Saint Maximus of Aquila) is one of| the patron saints of L'Aquila, Italy.') (Figure 11).
The different latents attend to the following cases (bold added for emphasis):
Examining the PCA, we observe that the 'hub' of the graph of co-occurrence relations (latent 12257) fires in all cases (Appendix Figure 12), but the 'spokes' act as modifiers that, in combination, activate in different cases. Further, we find that combinations of spoke latents, with this hub still active, correspond to distinct groups, which are predictable from a mixing of these basic categories (bold added for emphasis):
Figure 12: Latent compositional groups correlate with semantic groupings of contexts: (Left) PCA coloured by what combination of SAE latents are active for each case (note that feature is considered active only if above 1.5 activation strength). (Right) PCA coloured by context group (See Appendix Code 1 for grouping of semantic categories. The distinction between 'one of' and 'one of n' is the least clear, but this may in part be due to manual labelling of these examples missing potential implications of the total number far from the token that fires in the context.)
These categories form a continuum in the PCA, with the mixture of latents correlating strongly with the semantic mixtures (Figure 12). Thus qualitative descriptions of the number of items in a group appear to be encoded compositionally by SAE latents (Figure 13).
Figure 13: Relative Latent activation defines semantic groups: Mean activation for latents in each context group (See Appendix Code 1 for grouping of semantic categories).
There appears to be a relatively continuous shift in the activation of one spoke feature to another as we shift between semantic meanings (Figure 13). Motivated by this, we investigated how much the different semantic categories form a smooth continuum between 'some of' and 'all of'. We found that rather than compositional encoding of these categories by latents being merely Boolean, the relative strengths of the different latents encoded different qualitative quantities, again, as one would naively predict from the base categories. That is, if the activation of 'some of' is higher than 'all of' then one has a smaller quantity (e.g. 'many of') but if the activation of 'some of' is lower than 'all of' then one has a larger quantity (e.g. 'almost all of') (Figure 14).
Figure 14: Relative Latent activation defines semantic groups continuously and predictably: Mean activation for latents in each context group (See Appendix Code 1 for grouping of semantic categories), focussing only on those groups that are formed by composition of 'some of' (latent 15441) and 'all of' (latent 12649) 'spoke' latents along with 'hub' latent (latent 12257) that is present in all cases.
We see a similar cluster in layer 18 (cluster 59) with similar categories, suggesting that this persists between layers. We also see a similar cluster for layer 12 at a lower L0. The canonical 16K width SAE for Gemma-2-2b layer 12 (as accessed through SAElens) has a mean L0 of 80.47. For layer 12, there are also SAEs of this width with mean L0 of 22-445). For L0 of 22 we find a similar cluster with 5 nodes, and that it appears to form much more distinct categories (cluster 111), suggesting less compositional encoding.
Compositional encoding of distance by SAE features in short URLs
Figure 15: Cluster of SAE latents measuring position in url subdirectories in GPT2-Small. (Left) Cluster of extracted latents corresponding to tokens within url subdirectories e.g.'.twitter.com/|e|2zNEIdX' shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting the number of characters within the token that the latents are firing on.
In layer 8 of GPT2-small, using the 24546 width SAE from the latent splitting dataset on Neuronpedia, we find a co-occurrence cluster of 5 SAE latents that fires predominantly in the subdirectory of urls for social media e.g. '.twitter.com/|e|2zNEIdX' (cluster 125). The latents in this cluster fire predominantly on tokens of single or double character length.
Figure 16: Measurement of position with SAE latents (Top) PCA analysis of only those cases in which the token that activates the cluster is a single character, highlighting how far into the subdirectory the token fires (e.g. where the token that causes the SAE latent to activate is surrounded in '|', then in the string .twitter.com/|e|2zNEIdX is the 0th character, .twitter.com/e2|z|NEIdX is the third etc). (Centre) Mean activation of each SAE latent for each position in the url subdirectory. (Bottom) Activation of SAE latents (green) on different tokens within an example prompt (see Neuronpedia). Here we show lengths between 0 and 10 as there are very few examples longer than this, for plots without this filter see Appendix Figure 13
Focussing on the cases where the latent fires on a single character, we see that PCA separates the cases by the distance between the token and the beginning of the url subdirectory (e.g. in the string .twitter.com/|e|2zNEIdX is the 0th character of the subdirectory, .twitter.com/e2|z|NEIdX is the 2nd etc, where the token that causes the SAE latent to activate is surrounded in '|'). Measuring the activity of the cluster latents, we see that they activate for different sections of the url subdirectory (see Figure 16 and Neuronpedia list). This suggests that the composition of these latents may be used to recover the position of tokens in the url-subdirectory, but further work is needed to confirm the accuracy with which position is measured.
The exception to this is latent 19054, which activates predominantly on the lower-case characters of the string (Appendix Figure 14).
Co-occurrence relations detect sharing of 'day-of-the-week' features between SAE latents
Figure 17: Day-of-the-week cluster (GPT2-Small). (Left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'Monday') or with a space (e.g. ' Monday'). Note that in these and the following examples, while often the 'spokes' of the graph are associated with clusters in the PCA in a similar pattern, the order of features in the co-occurrence graph is not related to the order observed in the PCA.
In layer 0 we found a 8 latents cluster with a hub and spoke graph structure, where each latent promoted and activated on a day of the week (e.g. Monday) (see also these latents on Neuronpedia) (cluster 3240). Mapping out this subspace, using PCA on latent activations, we see a similar pattern to that observed by Engels' et al. (2024), where the latents form a ring of the days of the week in order in the PC2 vs PC3 direction. This suggests that the strength of the latent encodes the certainty that a token is e.g. a Monday, but that the direction of the latent encodes the relation of Monday to these other latents, namely that they have a correct ordering and are equally spaced.
In contrast to Engels' et al. (2024), we further observe that the activations form concentric rings, with the outer ring and inner rings corresponding to tokens that contain a day of the week with a space (e.g. ' Monday'), while the middle ring corresponds to a day of the week token without a space (e.g 'Monday') (Figure 17). Additionally, the very inner ring may also contain tokens that are a shortened version of the day e.g. ' Mon' still with a space (compare Figure 17 and Appendix Figure 15). Note that in these and the following examples, while often the 'spokes' of the graph are associated with clusters in the PCA in a similar pattern, the order of features in the co-occurrence graph is not related to the order observed in the PCA.
However, whether or not a token contains a space (e.g. 'Monday' vs ' Monday') is not encoded by the strength of the e.g. Monday latent activation (latent ID 3266), but rather the second ring is defined by the activation of the hub latent of the cluster (latent ID 8838) (Figure 18).
Figure 19: Latent activation for 'spoke' of the day-of-the-week cluster for the context '30 p.m.| Friday| and 6 a.m'. (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Activation of the latent corresponding to 'Friday' for all prompts (latent ID 3266). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Position of the example prompt in the PCA (star).
Each 'spoke' latent defines one of the seven days of the week, and only fires on these tokens (Figures 18 , 19), with strength corresponding to radius in the PCA, and potentially certainty, given that these latents fire more strongly on e.g. ' Friday' than ' Fri' (Figure 17, Appendix Figure 15). These latents therefore are independently interpretable, as is desirable for useful latent extraction.
Figure 20: Latent activation for 'hub' of the day-of-the-week cluster for the context 'pped Crusaders |Friday|, Nov. 16 at'. (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Activation of the latent corresponding to the hub for all prompts (latent ID 8838). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Position of the example prompt in the PCA (star).
However, the tokens that do not contain a space are detected by the activation of a composition of both the 'spoke' latent for the day and the 'hub' latent (latent ID 8838), with both latents being roughly equally active (Figure 20).
Figure 21: Comparison of latents for 'hub' and 'spoke'. (Left) Neuronpedia profile for spoke latent ID 14244 that appears to denote 'Tuesday' tokens. (Right) Neuronpedia profile for 'hub' latent latent ID 8838 which also activates primarily on 'Tuesday' tokens but we observe as denoting lack of a space in the token when seen in composition with the 'spoke' latents.
Analysed individually, the hub latent seems very similar to the 'spoke' Tuesday feature, and this role in a potentially compositional encoding would not be readily apparent without analysing this latent as part of the functional unit detected by this cluster, suggesting that in some cases these latents can be best understood as part of a larger functional unit defined at least in part by their co-occurrence relations (Figure 21). We observe very similar behaviour with months of the year in the same model and layer (see Appendix Figure 11). We also observe similar hub and spoke clusters where the hub denotes a different modification to the tokens that the spokes fire on, in layer 0 of GPT2-Small, such as: American vs British Spelling: hub activity appears to correspond to 'ise' vs 'ize', Words ending in 'ist' vs 'ism': hub activity appears to correspond to 'ist' vs 'ism' and Singular vs plural words for citizens of a country: hub activity appears to correspond to e.g. 'German' vs 'Germans'.
Other examples of apparent compositionality in GPT2-Small and Gemma-2-2b (found through a non-exhaustive qualitative search) are listed in Appendix: Case Studies and the SAE Latent Co-occurrence Explorer.
Encoding of continuous properties in feature strength without compositionality
In Gemma-2-2b, we also see cases where the latent strength correlates with a quantity (Figure 22, cluster 1370), or where a local code switches to one where there is encoding by latent strength (Appendix Figures 15, 16, Gemma-2-2b, layer 21, cluster 511), despite a cluster forming on something that it would be intuitive for a compositional code to form.
Figure 22: Cluster separating number words: (Left) PC1 vs PC3 of PCA of subspace mapped by SAE latents (Gemma-2-2b, layer 0, cluster 1370), colour is the number word in the activating token. (Right) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate.
For example, in cluster 1370 on layer 0 of Gemma-2-2b (16K width SAE), we see separation in PCA directions 1 and 3 of activations by number. Oddly, this cluster activates for numbers from 'two' to 'ten', but not 'one', and only for words ('two') not for digits ('2') (Figure 22).
In this cluster, activation strength of latents 8129 and 6449 corresponds to the size of the number, from two upwards (see Figure 23 and Appendix Figure 18). We find no other cluster for these words in this layer that separates them more cleanly, leaving the question of why distance in a url substring is handled by a potentially compositional encoding (see above), while this is apparently based on latent activation strength alone. This is not an isolated phenomenon, as we see a similar case for ordinal words (e.g. 'first', 'second', 'third'), but in that case we observe a switch from a local encoding of a latent per word to encoding apparently based on relative latent strength of latents that primarily fire on 'second' and 'third' (see Appendix Figure 16).
To investigate further, we train a linear probe to detect number words between one and ten, and not digits. We find that there is a direction in the activation space of the neurons that corresponds to number words from one to ten (see Appendix Figure 19). Searching for SAE latents that have high cosine similarity with this direction, we find that those latents that are within the cluster are represented in the top 10 latents most similar to the direction of the probe. However, there are also latents that are not present in the cluster, e.g. latent 9869, which is maximally activated by the word 'one' (Figure 24), which has higher cosine similarity.
This does not appear to be because we have incorrectly excluded these from the cluster, as these latents do not necessarily have high co-occurrence with other latents with similar direction to our probe. If we compare the rate of co-occurrence for pairs of latents in this group, to the mean cosine similarity of these pairs, we see correlation for those latents in the cluster, but many latents with high similarity for the probe that do not co-occur strongly with any of other latents (Figure 24, Appendix Figure 19). That is, these latents are missing from our cluster because they do not exhibit co-occurrence with other latents representing number words, rather than because our clustering method failed to group them correctly (Figure 24, Appendix Figure 19). This suggests that this example of less interpretable clustering may be related to feature absorption (Chanin et al., 2024), where the other latents that are maximally active for number words (i.e. 'two', 'three') etc are somehow 'sharing' the properties of these numbers while the latent that is maximally activated by 'one' is not. This has been shown to be more common in JumpReLU SAEs (Karnoven et al., 2024) such as those used here for Gemma-2-2b from Gemma-Scope (Lieberum et al., 2024, Rajamanoharan et al., 2024a).
Figure 24: Latents that are related to number words cluster only in a subset of cases: (Left) Top 10 SAE latents in layer 0 by cosine similarity with the direction of a linear probe trained to detect neuron activations in this layer that are associated with number words from 'one' to 'ten' and exclude digits, red bars represent those latents in the co-occurrence cluster in Figure 22. (Right) Correlation between the co-occurrence (y-axis) and the mean cosine similarity with the linear probe direction for pairs of these features. Red points represent pairs that have edges (i.e. strong co-occurrence) in the cluster. We label pairs between the top 5 latents from the left hand plot. Note that the most similar latent to the linear probe (latent 9869) has a very low rate of co-occurrence with other latents associated with number words, despite high cosine similarity to the probe (bottom right).
Ambiguity of beliefs may cause some cases of co-occurrence
Distinguishing between uses of the word 'how' in GPT2-Small
Figure 25: Cluster of SAE latents disambiguating uses of 'how' in GPT2-Small. (Left) Cluster of extracted latents corresponding to uses of the word 'how' shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Right) PCA of prompts containing these SAE latents, highlighting those whose context contains the word 'how' or 'how' and a question mark ('?').
In layer 8 of GPT2-Small, using the 24546 width SAE from the feature splitting dataset on Neuronpedia, we find a co-occurrence cluster of 5 SAE latents (see also on Neuronpedia) (cluster 787). These latents fire predominantly on the word 'how' , with those that are part of a question clustered together (Appendix Figure 20).
Figure 26: Use of 'how' to mean degree for example 'I think it shows just | how| difficult this issue is': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 21547.
This set of latents maps a space that covers different grammatical uses of the word 'how' as an interrogative adverb either to:
We do not see the fourth interrogative adverb use of how, the exclamative ( e.g. 'How very interesting!').
We note that unlike the case of the days of the week, not all extrema of the PCA are defined by a single feature, with the 'manner' case leading to activation of latents 11726 and 23664. Nevertheless, these latents only activate in these cases, so the latents remain generally independently interpretable.
Sampling points from one PCA extrema to another, we find that latent activation within the cluster is split across multiple latents in the cases that are between the extremes. The activation is predominantly split between the latents that define the different extremes, but the other latents in the cluster activate weakly as well. This may represent that the ambiguity of prompts far from the extremes causes multiple latents to activate in order to accommodate the different potential readings, representing 'uncertainty' in the model (Figure 27 and Appendix Figure 23, Appendix Figure 24). It is notable that the examples where multiple features are active are more difficult to distinguish between a question vs a statement (e.g. 'how I train you' could be a question or a statement, whereas 'How did you guys meet?' and 'how difficult this issue is' are more clear, see Figure 27).
Figure 27 Change in latent activation for examples on the continuum from 'how' as a question and 'how' as degree: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the main latent relating to 'how' as a question (latent ID 817) and 'how' as a matter of degree (latent ID 21576) for examples in the animation. For trend in all points along PC2 see Appendix Figure 24.
Distinguishing the type of entity whose possession is indicated by an apostrophe in Gemma-2-2b
In English, an apostrophe can be used to denote that the preceding word is a possessor e.g. 'It was Adam's apple'. The type of possessor can be a named person, a generic person (e.g. 'The Defendant's case was rejected'), a collective 'The company's profits decreased' or non-human or inanimate objects 'misalignment between the valve'|s| stem and seat'. We observe a cluster in layer 12 of Gemma-2-2b (cluster 4334) that appears to disambiguate these different cases, with a similar graph of co-occurrences to that observed in the prior example. (Figure 28), firing either on the 's' after the apostrophe (e.g. valve'|s|), or the apostrophe in cases where an 's' is conventionally omitted (e.g. sellers|'|).
Figure 28: Cluster that disambiguates possessor denoted by apostrophe: (Left) PCA of subspace mapped by SAE latents (Gemma-2-2b, layer 12, cluster 4334), colour is most active latent. (Right) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate.
The vertices of the PCA have the maximal activity of three of the latents, distinguishing cases:
See Figure 29.
Figure 29: Subspace maps possession by named person or group, generic person, or non-human or inanimate objects: (Left) SAE latent activation strength for all examples in PCA for latents (latent 5799 (top), latent 9754, (centre), latent 4572 (bottom). (Right) Context of tokens activating latest in the cluster for these extrema of the PCA. See also Appendix Figure 25
Between these vertices we see an approximately linear decay of the primary latent seen at one axis while the other latent increases in activity (see Figure 30 and Appendix Figure 26), suggesting that, as in GPT2-Small, in cases of ambiguity between different interpretations of a token, Gemma-2-2b SAE latents will co-occur more strongly.
Figure 30: Change in latent activation for examples on the continuum from named persons or groups to generic persons: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the latents for selected points between extrema of the PCA. See also Appendix Figure 26.
We observe many such cases of apparent disambiguation in both GPT2-Small, and Gemma-2-2b, see Appendix: Case Studies and the SAE Latent Co-occurrence Explorer.
Discussion
The ideal features to describe an LM would be independent and linear (Park et al., 2023), but it may not be possible to extract such features. SAE latents have many desirable properties for explaining LM behaviour based on their internals, but recent work has shown that they are not always linear (Engels et al., 2024). In this work we investigate whether SAE latents fire independently, how this depends on SAE size and architecture, and what this means for SAE latent interpretability.
First, we show that SAE latents co-occur both more and less than expected if they were independent, both for ReLU (Bloom, 2024, GPT2-Small) and JumpReLU (Gemma-Scope, Gemma Team, 2024, Lieberum et al., 2024, Rajamanoharan et al., 2024a) SAEs. Secondly, we find that this phenomenon becomes less prevalent as SAE width increases.
However, we also observe that in cases of SAE co-occurrence more than expectation, i.e. in clusters of SAE co-occurrence, these clusters map out interpretable subspaces, with interpretations including: days of the week, recapitulating Engels et al, 2024; disambiguation of the grammar of a token; and measuring the position of tokens within a string.
We observe two subgroups of clusters, which appear to be driven primarily by either: compositionality in the underlying LM features; and by ambiguity and multiple meanings of words in natural language. The former is particularly surprising, given the bias against finding the underlying features that act compositionally using SAEs, and the lack of such recovery in toy examples (Anders et al., 2024). This suggests that SAEs can extract compositionally encoded features, and that co-occurrence clustering may be a method to detect this behaviour. Indeed, it may be necessary to consider clusters as a whole in order to properly interpret SAE latents in some cases. For example, the role that the hub latent in the days-of-the-week cluster (latent 8838) plays in composition with other day of the week latents is not apparent from examining it alone, as it appears very similar to the leaf latents (e.g. latent 3266) when examining e.g. tokens promoted and max activating examples of prompts. Similarly, we observe a cluster with many latents that maximally activate on the token 'first' but in composition appear to be specialised to distinguish 'first', 'second' and 'third' (Appendix Figure 17). However, this composition does appear to be predictable from the activity of latents examined alone, e.g. in the case of qualitative amounts in Gemma-2-2b, we observe that the relative activation of latents active for 'some of' (latent 15441) and 'all of' (latent 12649) predictably and continuously correlate with a range of tokens from 'many of' to 'almost all of'.
We also observe clusters that form on 'ambiguous' tokens, those that have multiple meanings e.g. 'how'. This multi-layered meaning of words already shows signs of being encoded in the decoder weight structure of SAE latents (Bussmann et al., 2024, Shai et al., 2024), but here we observe a potential mechanism by which different meanings can be compared and weighted against one another. It is unclear how this changes as SAE width increases, there may be latents dedicated to more fine grain meanings, but this is unlikely to be able to represent as many weightings between possible meanings as a mixture of activity in co-occurrence.
Fortunately, within the subspaces mapped out by these clusters, SAE latents remain largely independently interpretable, and so we can be optimistic about the potential for SAE latents in general to be easily interpretable at scale. Furthermore, this phenomenon appears to decrease in relevance as SAE width increases. Similarly, due to the expense of training high width SAEs, being able to find and interpret clusters will be an important aspect to SAE based mechanistic interpretability in many cases. Finally, recent work has suggested that the optimal SAE size depends on the task (Karnoven et al., 2024). As compositional encoding may be easier to interpret than a large local code, our work shows another potential use of smaller SAEs.
Future Work
Alternate LM and SAE architectures
This work focussed on GPT2-Small (Radford et al., 2019) and SAEs trained with a standard ReLU (Bloom, 2024) and Gemma-2-2b (Gemma Team, 2024), using Gemma Scope (Lieberum et al., 2024), which uses JumpReLU (Rajamanoharan et al., 2024a). We find that the clusters in Gemma-2-2b tend to be less interpretable in general, and that, unlike GPT2-Small, the most active latents in a subspace mapped by a cluster are less likely to be the latents within the cluster. We also observe that latents that would be expected to co-occur do not, e.g. latents for 'two', 'three' etc co-occur, but not the latent for 'one' in layer 0 of Gemma-2-2b (see Figure 24). It is unclear whether this is a property of the SAE or of the underlying model. This could be clarified by expanding this work to SAEs trained on e.g. Llama3.1-8b and comparing to other SAE architectures trained on the same models, e.g. BatchTopK (Bussman et al., 2024), GatedSAE (Rajamanoharan et al., 2024b) and end-to-end (e2e) SAE (Braun et al., 2024).
Effect of SAE Sparsity on co-occurrence
Similarly, Gemma Scope SAEs have been released with different levels of L0 sparsity (Lieberum et al., 2024). We find that in the case of clustering on qualitative amounts (see Figure 11) that similar clusters exist for lower L0 SAEs that appear to be less compositional. It may be fruitful to explore the effects of this on co-occurrence further, with the hypothesis that higher L0 sparsity will lead to less compositional encoding and thus fewer co-occurrence clusters of this type.
Understanding the drivers of co-occurrence
Why co-occurrence occurs, why it decreases with larger SAEs and whether this will continue until all latents are independent, or whether some co-occurrence is unavoidable, requires further exploration. In particular, it may be that composition-driven and ambiguity-driven co-occurrence display different patterns in this regard.
It seems plausible that composition may occur especially in smaller SAEs, where only more coarse-grain latents can be learned. This suggests that a potential cause for SAE latent co-occurrence is that small SAEs, which can only generate a small number of latents, are likely to extract more general latents than large SAEs e.g. a small SAE might have a 'red' latent and a 'circle' feature, whereas the larger SAE could learn a more fine-grained latent such as 'red circle'. This means that although SAE latents are optimised for sparsity and therefore biassed towards a local code representation (Olah, 2023) of the true, underlying model features (Till et al, 2024, Anders et al., 2024), a small SAE cannot but have some compositionality if it is to minimise reconstruction loss.
For example, the url-subdirectory cluster separates tokens that are only a single character apart in strings that are approximately 10 characters long with only 5 latents, suggesting a compositional encoding, but a larger SAE may be able to dedicate a single latent for each position in the string, obviating the need for this. Thus, this may not reflect composition in the underlying LM at all. One way to explore this will be to explore whether these types of clusters occur less in larger SAEs. As compositional codes can be more interpretable, this may prove to be an advantage of smaller width SAEs.
A cluster formed due to ambiguity, on the other hand, may only grow in size as SAE latents split (Makelov et al., 2024), as more potential meanings can be assigned explicit latents, but nevertheless may need to be active at the same time to capture intentional ambiguity in, for example, word-play, humour and poetry. Conversely, a large SAE might be able to assign separate latents for a token used unambiguously as well as all possible ambiguous combinations, although this would not allow the weighting of these against one another. This latter case is complicated by the observation that rather than splitting into single representations of a concept, latents may instead have certain functions 'absorbed' by other latents as splitting occurs (Chanin et al., 2024). We may also expect such co-occurrence to be driven by other kinds of ambiguity e.g. when a model is 'considering' two potential courses of action.
Accuracy of compositional encoding
We find a cluster that appears to function as a way of measuring position in a string, in this case the subdirectory of a url. If this is truly acting compositionally, we would expect to be able to recover position with a greater accuracy than the number of latents involved, and so we aim to compare how well classifiers can recover token position based on SAE latent activation and the underlying neuron activations.
If this is the case, it further raises the question of whether there are other such specified positional latents, whether these latents have other properties in common with positional latents (Chughtai et al., 2024), and whether this provides the basis of a method to find other positional latents in general.
Similarly, initial extraction of the clusters from co-occurrence is difficult because there are many low weight edges and high degree nodes. Extracting nodes of the highest degree found latents that had high cosine similarity with the position embedding matrix, suggesting these are positional latents, as seen by Chughtai et al., 2024. This may be another method to extract positional latents more generally.
Can we relate co-occurrence to causal mediation in a meaningful way?
Figure 31: Example of tree of rules for classification. Simplified case of classifying a fruit into botanical categories.
We find examples of co-occurrence creating a subspace that can be understood as performing classification e.g. the 'how' cluster classifying different grammatical uses of the word. In cases with more defined rules e.g. fruit classification (Figure 31) one would expect that latents would form clear, nested relationships that can be recovered by our method and used to derive the underlying ruleset. For example, in this toy case 'peach' and 'plum' would co-occur with one another, and also with both 'fleshy fruit' and 'large pit'; but 'grape' and 'tomato' would co-occur only with 'fleshy fruit'. Thus the classification rule-set can be recovered from the nesting of co-occurrence of latents, and this can therefore serve to identify rule-based classifying circuits within LM. However, this assumes that both the coarse-grained (e.g. 'fleshy fruit') and fine-grain (e.g. 'plum') latents exist within the same SAE of a certain width, which might not be the case (see Bussmann et al., 2024 and Bussman et al., 2024).
Conclusion
SAE latents will be most easily interpretable if they are independent. We find that this is not the case, and instead that latents form clusters of co-occurrence that map out interpretable but often non-linear subspaces. Nevertheless, the latents remain independently interpretable in many cases, and this behaviour decreases in prevalence as SAE width increases. Despite this, we observe cases where one can only understand a latent in the co-occurence of its co-occurrence relations. Understanding the drivers of this will be an important part of ensuring that SAEs, and mechanistic interpretability, are useful for the wider goal of ensuring safe AI. This is because to predict and correct unsafe behaviour we require model features that do not only correspond to one part of this behaviour, or only in certain contexts, but rather an exhaustive understanding of all routes to unsafe behaviour. This work shows one part of how this can be accomplished, by demonstrating how SAE latents can form larger functional units that we can detect and understand.
Author Contributions
Conceptualisation: MAC, JB; Data Curation: MAC; Investigation: MAC, JB; Methodology: MAC, HB, JB; Project Administration: JB; Resources: JB; Software development: MAC, HB, JB; Supervision: JB; Visualisation: MAC, HB, JB; Writing – Original Draft: MAC, JB; Writing - Review & Editing: MAC, HB, JB.
Acknowledgements
Thanks to Clem von Stengel, Andy Arditi, Jan Bauer, Kola Ayonrinde and Fernando Rosas for useful discussions, and Owen Parsons for feedback on the draft. Thanks to the entire PIBBSS team for their support and for providing funding for this project. Thanks also to grant providers funding Joseph Bloom during the time he mentored this project.
Appendix
Case Studies: Interpretable clusters that can be explored in the SAE Latent Co-occurrence Explorer.
This is not an exhaustive list, rather a subset of apparently interpretable subspaces found by qualitative analysis.
Compositionality
GPT2-Small
Gemma-2-2b
Ambiguity
GPT2-Small
Gemma-2-2b
Number and fraction of SAE latents active in Gemma-2-2b for different SAE widths and L0
Appendix Figure 1: (Left) Change with SAE width (Right) Change with SAE mean L0. (Red) Number of latents active. (Blue) fraction of latents active.
SAE latent co-occurrence vs expectation for different SAE widths in layer 8 of GPT2-Small
Appendix Figure 2: (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths with y rescaled to log10 for GPT2-Small. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
SAE latent co-occurrence vs expectation for different SAE widths in layer 12 of Gemma-2-2b
Appendix Figure 3: (Left) Boxplot of SAE latent co-occurrence per token for different SAE widths with y rescaled to log10 for Gemma-2-2b. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
SAE latent co-occurrence vs expectation for different SAE L0 in layer 12 of Gemma-2-2b
Appendix Figure 4: (Left) Boxplot of SAE latent co-occurrence per token for different SAE L0 with y rescaled to log10 for Gemma-2-2b. (Right) Density of rates of co-occurrence for a sample of SAE sizes. Note that very sparse and potentially dead latents occur very rarely in large SAEs leading to 'spikes' with a left skew. Blue is observed co-occurrence, red is expected co-occurrence.
Mean subgraph size with SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 5: (Left) mean size of clusters as SAE width increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed) in GPT-2. (Right) in Gemma-2-2b.
Mean subgraph size with SAE L0 for Gemma-2-2b
Appendix Figure 6: (Left) mean size of clusters as SAE mean L0 increases vs mean size of clusters of size greater than 1 (i.e. excluding isolated latents) (dashed).
Mean feature and subgraph sparsity vs SAE width for GPT2-Small and Gemma-2-b
Appendix Figure 7: L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) vs SAE width (dashed) (Left, GPT2-Small, right, Gemma-2-2b).
Mean feature and subgraph sparsity vs SAE L0 for Gemma-2-b
Appendix Figure 8: L0 sparsity for individual SAE latents vs clusters (considering a cluster as active if any of the latents it is composed of are active) vs SAE L0 (dashed).
Fraction of SAE latents in a cluster vs SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 9: (Left) fraction of latents in cluster as SAE width increases in GPT-2. (Right) in Gemma-2-2b.
Fraction of SAE latents in a cluster vs SAE width in GPT2-Small and Gemma-2-2b
Appendix Figure 10: (Left) fraction of latents in cluster as SAE L0 increases in Gemma-2-2b.
Compositionality in Month of the year in layer 0 of GPT2-Small, cluster 2644
Appendix Figure 11: Month-of-the-year cluster. (Top left) Cluster of extracted latents corresponding to months of the year shown with Jaccard similarity (edge weight) as edge thickness, and the overall occurrence of a latent as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'October') or with a space (e.g. ' October'). (Bottom left) Activation strength of 'spoke' latent ID 3877. (Bottom right) Activation strength of 'hub' latent ID 10676.
We observe a similar phenomenon with months of the year, with only two rings of latents, and the inner ring once again being defined by the hub latent (latent ID 10676), and this hub latent once again denoting the lack of a space in the activating token (Appendix Figure 11, cluster 2644 in our app). Interestingly, this cluster also lacks a 'spoke' latent for the month of May, despite there being an SAE latent that appears to correspond to this month (latent ID 21089). This SAE latent is isolated (does not form a cluster with anything else).
SAE latent activation strength in PCA for 'n of' cluster (Gemma-2-2b, layer 12, cluster 4740)
Appendix Figure 12 Activation strength for each SAE latent in Gemma-2-2b, layer 12, cluster 4740 for PCA directions PC2 vs PC3.
Activations with length within url subdirectory (GPT2-Small, layer 8, cluster 125)
Appendix Figure 13: (Top) PCA analysis of only those cases in which the token that activates the cluster is a single character, highlighting how far into the subdirectory the token fires (e.g. where the token that causes the SAE latent to activate is surrounded in '|', then in the string .twitter.com/|e|2zNEIdX is the 0th character, .twitter.com/e2|z|NEIdX is the third etc). (Bottom) Mean activation of each SAE latent for each position in the url subdirectory.
Measurement of token position in url subdirectory in layer 8 of GPT2-Small with 24576 width SAE (cluster 125)
Appendix Figure 14: Lowercase character detection in url subdirectories. (Left) number of times a latent occurs in samples of prompts containing activations of the SAE latents in cluster 125 for GPT2-Small, 24K width SAE, layer 8 (see list here). (Right) Maximum activating tokens shown in green for latent ID 19054.
Day of the week latent in layer 0 of GPT2-Small with 24576 width SAE
Appendix Figure 15: Day-of-the-week cluster showing only full days of the week. PCA of prompts containing these SAE latents, highlighting those firing on tokens pertaining to a day of the week either alone (e.g. 'Monday') or with a space (e.g. ' Monday') but not including shortened cases e.g. 'Mon'. Note how compared to Figure 17 this mainly affects the innermost ring, i.e. where the 'spoke' latent activation is weakest.
Ordinal numbers (e.g. first, second, third) show switch from local code to encoding in strength of latent activation (Gemma-2-2b, layer 21, cluster 511)
Examining the 16K SAE for layer 21 of Gemma-2-2b, we find a cluster of latents that activate on ordinal words (e.g. 'first', 'second', 'third') (cluster 511). We observe that the PCA of prompts that contain tokens that activate these latents separates into clusters in the order of these ordinal words up until 'fourth'/'fifth' after which there is no longer clear separation (see Appendix Figure 16). To clarify this we also perform PCA for a custom set of prompts with equal numbers of ordinal words from 'Zeroeth' to 'Tenth' (Appendix Figure 17).
Appendix Figure 16: (Top) PCA of subspace mapped by SAE latents, colour is ordinal number in the activating token. (Bottom, left) Graph of co-occurrence relations between latents. Colour is the overall occurrence of the feature in the dataset, edge weight is Jaccard normalised co-occurrence rate. (Bottom, right) Mean activation strength of SAE latents within cluster for different ordinal numbers.
We observe that the lower ordinal words have latents that activate more specifically, although note that three of the latents activate strongly on the word first in Neuronpedia max activating examples (latents 2795, 6539 and 7341), whereas higher numbers activate the latent associated with higher numbers (latent 901) more strongly. However, analysing these as a cluster, the relative strength of the activations of these latents suggests a different interpretation, with latent 7341 activating most strongly on 'first', but 2795, 6539 activating roughly equally on all ordinal words (see Appendix Figures 15, 16). This again shows how interpreting latents that form co-occurrence clusters as a group and in context is important and can allow us to reveal the differences between apparently redundant SAE latents.
Thus there is a transition from local encoding of the words with a single latent, to encoding by the the strength of latent 901, or possibly the relative strength of latent 901 vs latent 523 (Appendix Figure \16).
Appendix Figure 17: (Top) PCA of subspace mapped by SAE latents, colour is ordinal number in the activating token, using custom prompts to ensure equal numbers of ordinal words in the dataset. (Bottom, left) Mean activation strength of SAE latents within cluster for different ordinal numbers. (Bottom, right) Relative mean activation strength of SAE latents within cluster for different ordinal numbers (normalised to sum to one).
Relative latent strengths in Gemma-2-2b layer 0 cluster 1370
Appendix Figure 18: Mean relative activation strength of latents in Gemma-2-2b, layer 0, cluster 1370 for different number words. Normalised to be equal strength to control for fewer cases of higher numbers and show relative activation strength.
Linear Probes for Layer 0 Cluster 1370 in Gemma-2-2b
Appendix Figure 19: (Left) Training metrics for linear probe for number words in layer 0 of Gemma-2-2b. (Right) Heatmap of raw co-occurrence between top 10 latents most cosine similar to the direction of the probe, red highlights indicates pairs connected in Layer 0 Cluster 1370 in Gemma-2-2b.
Classification of the uses of the word 'how' in layer 8 of GPT2-Small with 24576 width SAE (cluster 787)
Appendix Figure 20: SAE latent activation strength. Each subplot shows the activity of one of the SAE latents for each point in the PCA (blue is low, yellow is high) in GPT2-Small layer 8, 24K width SAE, cluster 787.
Exploration of the subspace disambiguating 'how' in layer 8 of GPT2-Small with 24576 width SAE (cluster 787)
Appendix Figure 21: Use of 'how' to mean manner for example 'to craft guidelines on| how| doctors can ethically use': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 11726 and latent ID 23664 in layer 8 of GPT2-Small with 24576 width SAE (cluster 787).
Appendix Figure 22: Use of 'how' to mean manner for example '. AG:| How| did you guys meet?': (Top left) Cluster of extracted latents corresponding to days of the week shown with Jaccard similarity (edge weight) as edge thickness, and the occurrence of a latent for this example prompt shown as the colour (light to dark). Each latent (node) is shown with the ID number (see Neuronpedia) and token factor (projection of the SAE decoder matrix onto the LM token embedding). (Top right) Position of the example prompt in the PCA (star). (Bottom left) Activation of latents for the example prompt, within the cluster highlighted in blue. (Bottom right) Examples of contexts for tokens leading to activation of latent ID 817 in layer 8 of GPT2-Small with 24576 width SAE (cluster 787).
Feature Activation Strength along the axes of PCA for 'how' cluster (GPT2-Small, Layer 8, 24K width SAE, cluster 125)
Appendix Figure 23: Change in latent activation for examples on the continuum from 'how' as a matter of degree and 'how' as matter of manner: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in GPT2-Small, Layer 8, 24K width SAE, cluster 125.
Appendix Figure 24: Change in latent activation for examples on the continuum from 'how' as a question and 'how' as degree: (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the main latent relating to 'how' as a question (latent ID 817) and 'how' as a matter of degree (latent ID 21576) for all examples in the entire PCA, plotted against PC2 in GPT2-Small, Layer 8, 24K width SAE, cluster 125.
Strength of all latents for 'possessor' cluster (Gemma-2-2b layer 12 cluster 4334)
Appendix Figure 25 Activation strength for each SAE latent in Gemma-2-2b layer 12 cluster 4334 for PCA directions PC2 vs PC3.
Feature Activation Strength along the axes of PCA for 'possessor' cluster (Gemma-2-2b, layer 12, 16K width, cluster 4334)
Appendix Figure 26. (Top left) Position of examples in PCA. (Top centre) activation of SAE latents within the cluster. (Top right) activation of SAE latents within the cluster shown in cluster. (Bottom) Activation of the latents for selected points between extrema of the PCA (from named persons to inanimate or non-human objects) in Gemma-2-2b layer 12 cluster 4334.
Pseudocode for grouping categories in Gemma-2-2b Layer 12 Cluster 4740 ('one of')
groups = {
"one_of": pca_df[context_col].str.contains(r"\b(one of)\b")
& ~pca_df[context_col].str.contains(
r"\b(two|three|four|five|six|seven|eight|nine|ten|first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|these|least|any|each)\b"
),
"one_of_n": pca_df[context_col].str.contains(r"\bone of\b")
& pca_df[context_col].str.contains(
r"\b(two|three|four|five|six|seven|eight|nine|ten|first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|these|least|any|each)\b"
),
"each_of": pca_df[context_col].str.contains(
r"\b(both of|each of|neither of|every one of|either of)\b"
),
"some_of": pca_df[context_col].str.contains(r"\bsome of\b"),
"most_of": pca_df[context_col].str.contains(
r"\b(many of|most of|almost all of|nearly all of)\b"
),
"all_of": pca_df[context_col].str.contains(r"\ball of\b")
& ~pca_df[context_col].str.contains(r"\b(almost|nearly)\b"),
}
Appendix Code 1 Code for grouping of semantic categories for plots of Gemma-2-2b Layer 12 Cluster 4740. See also https://github.com/MClarke1991/sae_cooccurrence.