TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model’s own computations make use of.

Written at Apollo Research

Introduction

Claim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.

Let’s walk through this claim.

What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This is in contrast to interpretability work that leverages the wider functional structure of the model and incorporates more information about how the model performs computation. Examples of existing techniques using such information include Transcodersend2end-SAEs and joint activation/gradient PCAs

What do we mean by “features of the activations”? Sets of features that help explain or make manifest the statistical structure of the model’s activations at particular layers. One way to try to operationalise this is to ask for decompositions of model activations at each layer that try to minimise the description length of the activations in bits.

What do we mean by “features of the model”? The set of features the model itself actually thinks in, the decomposition of activations along which its own computations are structured, features that are significant to what the model is doing and how it is doing it. One way to try to operationalise this is to ask for the decomposition of model activations that makes the causal graph of the whole model as manifestly simple as possible: We make each feature a graph node, and draw edges indicating how upstream nodes are involved in computing downstream nodes. To understand the model, we want the decomposition that results in the most structured graph with the fewest edges, with meaningfully separate modules corresponding to circuits that do different things. 

Our claim is pretty abstract and general, so we’ll try to convey the intuition behind it with concrete and specific examples.

Examples illustrating the general problem 

In the following, we will often use SAEs as a stand-in for any technique that decomposes individual activation spaces into sets of features. But we think the problems these examples are trying to point to apply in some form to basically any technique that tries to decompose individual activation spaces in isolation.[1]

1. Activations can contain structure of the data distribution that the models themselves don’t ‘know’ about.  

Consider a simple model that takes in a two-dimensional input  and computes some scalar function of the two, . Suppose for all data points in the data distribution, the input data  falls on a very complicated one-dimensional curve. Also, suppose that the trained model is blind to this fact and treats the two input variables as entirely independent (i.e. none of the model’s computations make use of the relationship between  and ). If we were to study the activations of this model, we might notice this curve (or transformed curve) and think it meaningful. 

In general, data distributions used for training (and often also interpreting) neural networks contain a very large amount of information about the process that created said dataset. For all non-toy data distributions, the distribution will reflect complex statistical relationships of the universe. A model with finite capacity can't possibly learn to make use of all of these relationships. Since activations are just mathematical transformations of inputs sampled from this data distribution, by studying neural networks through their distribution of activations, we should expect to see many of those unused relationships in the activations. So, fully understanding the model’s activations can in a sense be substantially harder than fully understanding what the model is doing. And if we don’t look at the computations the model is carrying out on those activations before we try to decompose them, we might struggle to tease apart properties of the input distribution and properties of the model.[2]

2. The learned feature dictionary may not match the “model’s feature dictionary”.

Now let’s consider another one-dimensional curve, this time embedded in a ten-dimensional space.[3] One of the nice things about sparse dictionary methods like SAEs is that they can approximate curves like this pretty well, using a large dictionary of features with sparse activation coefficients. If we train an SAE with a dictionary of size 500 on this manifold, we might find 500 features, only a very small number of which are active at a time, corresponding to different tiny segments of the curve.[4]

Suppose, however, that the model actually thinks of this single dense data-feature as a sparse set of  linear directions. We term this set of directions the “model’s dictionary”. The model’s dictionary approximates most segments of the curve with lower resolution than our dictionary, but it might approximate some crucial segments a lot more finely. MLPs and attention heads downstream in the model carry out computations on these 100 sparsely activating directions. The model’s decomposition of the ten-dimensional space into 100 sparse features and our decomposition of the space into 500 sparse features are necessarily quite different. Some features and activation coefficients in the two dictionaries might be closely related, but we should not expect most to be. If we are not looking at what the model does with these activations downstream, how can we tell that the feature dictionary we find matches the model’s feature dictionary? When we perform the decomposition, we don’t know yet what parts of the curve are more important for what the model is computing downstream, and thus how the model is going to think about and decompose the ten-dimensional subspace. We probably won’t even be aware in the first place that the activations we are decomposing lie on a one-dimensional curve without significant extra work.[5]

3. Activation space interpretability can fail to find compositional structure. 

Suppose our model represents four types of object in some activation space: {blue square, red square, blue circle, red circle}.[6] We can think of this as the direct product space {blue, red}  {square, circle}. Suppose the model’s 'true features' are colour and shape, in the sense that later layers of the model read the 'colour' variable and the 'shape' variable independently.  Now, suppose we train an SAE with 4 dictionary elements on this space. SAEs are optimised to achieve high sparsity -- few latents should be active on each forward pass. An SAE trained on this space will therefore learn the four latents {blue square, red square, blue circle, red circle} (the "composed representation"), rather than {blue, red}  {square, circle} (the "product representation"), as the former has sparsity 1, while the latter has sparsity 2. In other words, the SAE learns features that are compositions of the model’s features. 

Can we fix this by adjusting the sparsity penalty? Probably not. Any sparse-dictionary approach set to decompose the space as a whole will likely learn this same set of four latents, as this latent set is sparser, with shorter description length than the product set. 

While we could create some ansatz for our dictionary learning approach that specifically privileges the product configuration, this is cheating. How would we know the product configuration and not the composed configuration matches the structure of the model’s downstream computations in this case in advance, if we only look at the activations in isolation? And even if we do somehow know the product configuration is right, how would we know in advance to look for this specific 2x2 structure? In reality, it would additionally be embedded in a larger activation space with an unknown number of further latents flying around besides just shape and colour. 

4: Function approximation creates artefacts that activation space interpretability may fail to distinguish from features of the model.

This one is a little more technical, so we’ll take it in two stages. First, a very simplified version, then something that’s closer to the real deal.

Example 4a: Approximating . Suppose we have an MLP layer that takes a scalar input  and is trained to approximate the scalar output . The MLP comprises a  matrix (vector, really) of shape  that maps  to some pre-activation. This gets mapped through a ReLU, giving a  dimensional activation vector . Finally, these are mapped to a scalar output via some  matrix of shape . Thus, concretely, this model is tasked with approximating  via a linear combination of ten functions of the form . Importantly, the network only cares about one direction in the 10 dimensional activation space, the one which gives a good approximation of  and is projected off by . There are 9 other orthogonal directions in the hidden space. Unless we know in advance that the network is trying to compute , this important direction will not stick out to us. If we train an SAE on the hidden activations, or do a PCA, or perform any other activation decomposition of our choice, we will get out a bunch of directions, and likely none of them will be .[7] What makes the  direction special is that the model uses it downstream (which, here, means that this direction is special in ). But that information can't be found in the hidden activations alone. We need more information. 

Example 4b: Circuits in superposition. The obvious objection to Example 4a is that  is natively a rank one matrix, so the fact that only one direction in the 10 dimensional activation space matters is trivial and obvious to the researcher. So while we do need to use some information that isn’t in the activations, it’s a pretty straightforward thing to find. But if we extend the above example to something more realistic, it’s not so easy anymore. Suppose the model is computing a bunch of the above multi-dimensional circuits in superposition. For example, take an MLP layer instead with 40,000 neurons, computing 80,000 functions of (sparse) scalar inputs, each of which requires 10 neurons to compute, and writes the results to a 10,000 dimensional residual stream.[8][9] Each of these 80,000 circuits would then occupy some ten-dimensional subspace in the 40,000 dimensional activation space of the MLP, meaning the subspaces must overlap. Each of these subspaces may only have one direction that actually matters for downstream computation. 

Our SAE/PCA/activation-decomposition-of-choice trained on the activation space will not be able to tell which directions are actually used by the model, and which are an artefact of computing the directions that do matter. They will decompose these ten-dimensional subspaces into a bunch of directions, which almost surely won’t line up with the important ones. To make matters worse, we might not immediately know that something went wrong with our decomposition. All of these directions might look like they relate to some particular subtask when studied through the lens of e.g. max activating dataset examples, since they’ll cluster along the circuit subspaces to some extent. So the decomposition could actually look very interesting and interpretable, with a lot of directions that appear to somewhat-but-not-quite make sense when we study them. However, these many directions will seem to interact with the next layer in a very complicated manner.

The general problem 

Not all the structure of the activation spaces matters for the model’s computations, and not all the structure of the model’s computations is manifest in the structure of individual activation spaces. 

So, if we are trying to understand the model by first decomposing its activation spaces into features and then looking at how these features interact and form circuits, we might get a complete mess of interactions that do not make the structure of the model and what it is doing manifest at all. We need to have at least some relevant information about how the model itself uses the activations before we pick our features, and include that information in the activation space decomposition. Even if our goal is just to understand an aspect of the model’s representation enough for a use case like monitoring, looking at the structure of the activation spaces rather than the structure of the model’s computations can give us features that don’t have a clean causal relationship to the model’s structure and which thus might mislead us. 

What can we do about this?

If the problem is that our decomposition methodologies lack relevant information about the network, then maybe the solution is giving them more of it. How could we try to do this?

Guess the correct ansatz. We can try to make a better ansatz for our decompositions by guessing in advance how model computations are structured. This requires progress on interpretability fundamentals, through e.g. understanding the structure and purpose of feature geometry better. Note however that the current favoured roadmap for making progress on those topics seems to be “decompose the activations well, understand the resulting circuits and structure, and then hope this yields increased understanding”. This may be a bit of a chicken-and-egg situation. 

Use activations (or gradients) from more layers. We can try to use information from more layers to look for decompositions that simplify the model as a whole. For example, we can decompose multiple layers simultaneously and impose a sparsity penalty on connections between features in different layers. Other approaches that fall vaguely in this category include end-to-end-SAEsAttribution Dictionary LearningTranscoders, and Crosscoders.[10]

Use weights instead of or to supplement activations. Most interpretability work studies activations and not weights. There are good reasons for this: activations are lower dimensional than weights. The curse of dimensionality is real. However, weights, in a sense, contain the entire functional structure of the model, because they are the model. It seems in principle possible to decompose weights into circuits directly, by minimising some complexity measure over some kind of weight partitioning, without any intermediary step of decomposing activations into features at all. This would be a reversal of the standard, activations-first approach, which aims to understand features first and later understand the circuits. Apollo Research are currently trying this.

 

Thanks to Andy Arditi, Dan Braun, Stefan Heimersheim and Lee Sharkey for feedback.

 

  1. ^

    Unless we cheat by having extra knowledge about the model’s true features that lets us choose the correct form of the decomposition before we even start.

  2. ^

    An additional related concern is that we might end up with different conclusions about our model if we study it through a different data-distribution-lens. This seems problematic if our end goal is to study the model, which surely has some ground truth set of features it uses, independently of the data-lens used to extract them. Empirically, we do find that the set of SAE features we discover are highly (SAE training) dataset dependent

  3. ^

    Data on this manifold is importantly not actually representable as a set of sparsely activating discrete features.

  4. ^

    If we train SAEs on `blocks.0.hook_resid_pre` of gpt2-small, we find such a set, corresponding to the positional encoding

  5. ^

    Though note this particular citation is easy-mode due to the curve being low dimensional and easy to guess. We should not expect it to be this easy in general to find the structure of interest.

  6. ^

    This example is inspired by this Anthropic blog post.

  7. ^

    This seems like an easy experiment to do!

  8. ^

    Note that this doesn’t have to be a continuous function like , a boolean circuit e.g. evaluating some logical statement as True/False works as well. The fundamental problem here is that many operations can’t and won’t be computed using only a single neuron per layer, but rather a specific linear combination of multiple neurons. So implementing them almost inevitably produces extra structure in the activations that won’t be used. This is not a problem with algorithmic tasks specifically. 

  9. ^

    See circuits in superposition for an explanation of how to compute more functions in a layer than we have neurons.

  10. ^

    Of course, just doing something with activations or gradients is not enough; you have to do something that successfully deals with the kinds of counterexamples we list above. We doubt the vanilla version of any currently public technique does this for all relevant counterexamples or even all counterexamples we list here.

New Comment
14 comments, sorted by Click to highlight new comments since:
[-]wesg160

This seems like an easy experiment to do!

Here is Sonnet 3.6's 1-shot output (colab) and plot below. I asked for PCA for simplicity.

Looking at the PCs vs x, PC2 is kinda close to giving you x^2, but indeed this is not an especially helpful interpretation of the network.

Good post!

I played around with the  example as well and got similar results. I was wondering why there are two more dominant PCs: If you assume there is no bias, then the activations will all look like 

 or  and I checked that the two directions found by the PC approximately span the same space as . I suspect something similar is happening with bias.

In this specific example there is a way to get the true direction w_out from the activations: By doing a PCA on the gradient of the activations. In this case, it is easily explained by computing the gradients by hand: It will be a multiple of w_out. 

See the second to last paragraph. The gradients of downstream quantities with respect to the activations contains information and structure that is not part of the activations. So in principle, there could be a general way to analyse the right gradients in the right way on top of the activations to find the features of the model. See e.g. this for an attempt to combine PCAs of activations and gradients together.

Thanks for the reference, I wanted to illuminate the value of gradients of activations in this toy example as I have been thinking about similar ideas.

I personally would be pretty excited about attribuition dictionary learning, but it seems like nobody did that on bigger models yet.

In my limited experience, attribution-patching style attributions tend to be a pain to optimise for sparsity. Very brittle. I agree it seems like a good thing to keep poking at though. 

Did you use something like  as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?

The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.

Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn't straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation. 

I see, thanks for sharing!

I found this clear and useful, thanks. Particularly the notes about compositional structure. For what it's worth I'll repeat here a comment from ILIAD, which is that there seems to be something in the direction of SAEs, approximate sufficient statistics/information bottleneck, the work of Achille-Soatto and SLT (Section 5 iirc) which I had looked into after talking with Olah and Wattenberg about feature geometry but which isn't currently a high priority for us. Somebody might want to pick that up.

Are you suggesting that there should be a formula similar to the one in Proposition 5.1 (or 5.2) that links information about the activations  with the LC as measure of basin flatness?

I agree that the ultimate goal is to understand the weights. Seems pretty unclear whether trying to understand the activations is a useful stepping stone towards that. And it's hard to be sure how relevant theoretical toy example are to that question.

Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1). 

In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features. 

Thank you. Yes, our claim isn't that SAEs only find composed features. Simple counterexample: Make a product space of two spaces with  dictionary elements each, with an average of  features active at a time in each factor space. Then the dictionary of  composed features has an  of , whereas the dictionary of  factored features has an  of , so a well-tuned SAE will learn the factored set of features. Note however that just because the dictionary of  factored features is sparser doesn't mean that those are the features of the model. The model could be using the  composed features instead, because that's more convenient for the downstream computations somehow, or for some other reason.

Our claim is that an SAE trained on the activations at a single layer cannot tell whether the features of the model are in composed representation or factored representation, because the representation the model uses need not be the representation with the lowest 

Nice post! I think these are good criticisms that don't justify the title. Points 1 through 4 are all (specific, plausible) examples of ways we may interpret the activation space incorrectly. This is worth keeping in mind, and I agree that just looking at the activation space of a single layer isn't enough, but it still seems like a very good place to start. 

 

A layer's activation is a relatively simple space, constructed by the model, that contains all the information that the model needs to make its prediction. This makes it a great place to look if you're trying to understand how the model's thinking.