TL;DR

SAE features are often less decomposable than the feature descriptions imply. By leveraging a prompting technique to test potential sub-components of individual SAE features (for example, decomposing Einstein into “German”, “physics”, "relativity” and “famous”, I found very divergent behaviour in how decomposable these features were. I built an interactive visualization to explore these differences by feature. The key finding is that although many features can be decomposed in a human-intuitive way such as in the Einstein example above, many cannot, and these indicate more opaque model behaviour.

Motivation

The goal of this writeup is to explore the atomicity and decomposability of SAE features? How precisely do they describe the sets of inputs that will cause them to activate? Are there cases where inputs that activate SAEs are non-intuitive and unrelated to concepts that we might expect to be related?

Apart from being an interesting area of exploration, I think this is also an important question for alignment. SAE features represent our current best attempt at inspecting model behaviour in a human-interpretable way. Non-intuitive feature decompositions might indicate the potential for alignment failures.

Prior Work

I was partly inspired by this work on “meta-SAEs” (training SAEs on decoder directions from other SAEs) because it clearly demonstrated that SAE features aren’t necessarily atomic, and it is possible to decompose them into more granular latent dimensions. I was curious as to whether it was possible to come at this problem from a different direction. Given a particular SAE feature, can we generate inputs that “should” activate this feature in areas that a human would think of as related, and observe the feature activations that we see.

Methodology

I used the pretrained Gemma Scope SAEs and randomly sampled features from layer 20 of the 2B parameter model. The choice of layer 20 was somewhat arbitrary – as a layer deeper in the model the hope was to be able to work with more high-level and abstract features. I prompted ChatGPT to produce, for each feature:
 

  • A set of five sub-components that “comprised” that feature, as discussed in the Einstein example above.
  • For each subcomponent, a set of three “activating phrases” – sentences that we would expect to activate this subcomponent, if the subcomponent were a feature. For example, for “physicist” above, we might generate "he went into the lab to conduct experiments on optics, electromagnetism and gravity to discover the laws of the universe."

The results were returned in JSON in order to be machine-consumable.

I then measured the activation of the original feature using the original SAE by each of the activating phrases (15 in total – 3 for each subcomponent). In practice I ended up using the Neuronpedia API to run most of this due to being compute-constrained.

Results

Streamlit visualization to view this breakdown by feature

I found significant differences in how likely a given feature was to be cleanly decomposable. There were three main classes of behaviour: 

  • Features that did decompose cleanly and produced activations across all of the sub-components we generated.
    • About 35% of the features analyzed fit into this category.
  • Features that decomposed into more specific behaviour than expected.
    • These corresponded to features that had a wider description on Neuronpedia, but when prompting them they consistently activated more frequently on specific sub-components or subsets of the original meaning.
    • An example of this is feature 6772, which is described as terms related to historical and political context in Denmark, but tends to activate mostly on inputs related to war, especially the German occupation during World War 2. Browsing Neuronpedia for this feature, we can indeed see that many of the top activations are related to war and specifically occupation.
  • Features that I was unable to activate using this automated technique at all
    • Around a third of features consistently failed to produce nonzero activations using this technique. A large part of this due to the prompting technique producing text descriptions as opposed to the actual character pattern being activated (eg: producing a description of curly braces instead of the actual {} characters the features are likely to be activated on).
    • However other features such as feature 2779 are still somewhat mysterious. This one is labelled “statistical terms and variables related to research studies”, but does not activate on seemingly related concepts such as regressions, correlations and p-values and the top activations look unrelated.

Limitations and Future Work

This prompting technique has limitations as it doesn’t directly analyze SAE or model internals to capture feature meaning. It does also have advantages as it can be more model and technology-agnostic. Some specific limitation:

  • The automatically-generated prompts might fail to capture the true behaviour of the feature (describing curly braces instead of generating a pattern that might capture them), and better prompting might fix this very quickly.
  • A larger number of features needs to be analyzed.
  • Capturing and decomposing a feature entirely is inherently a noisy problem, and it’s unclear what it would mean to do this in an entirely principled manner.
  • It isn’t obvious that a model would break down concepts or represent them internally in the same way a human might.
New Comment