But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic's previous work "primarily focus[ed] on a more modest 8× expansion". It does make sense to look for a lot of features, but this seemed to be worth mentioning.
There's both theoretical work (i.e. this theory work) and empirical experiments (e.g. in memorization) demonstrating that models seem to be able to "know" O(quadratically) many things, in the size of their residual stream.[1] My guess is Sonnet is closer to Llama-70b in size (~8.2k features), so this suggests ~67M features naively, and also that 34M is reasonable.
Also worth noting that a lot of their 34M features were dead, so the number of actual features is quite a bit lower.
You might also expect to need O(Param) params to recover the features, so for a 70B model with residual stream width 8.2k you want 8.5M (~=70B/8192) features.
It's seems plausible to me that a 70b model stores ~6 billion bits of memorized information. Naively, you might think this requires around 500M features. (supposing that each "feature" represents 12 bits which is probably a bit optimistic)
I don't think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I'm pretty skeptical of a view like "scale up SAEs and get all the features".
(If you wanted "feature" to mean something.)
70b storing 6b bits of pure memorized info seems quite reasonable to me, maybe a bit high. My guess is there's a lot more structure to the world that the models exploit to "know" more things with fewer memorized bits, but this is a pretty low confidence take (and perhaps we disagree on what "memorized info" means here). That being said, SAEs as currently conceived/evaluated won't be able to find/respect a lot of the structure, so maybe 500M features is also reasonable.
I don't think SAEs will actually work at this level of sparsity though, so this is mostly besides the point.
I agree that SAEs don't work at this level of sparsity and I'm skeptical of the view myself. But from a "scale up SAEs to get all features" perspective, it sure seems pretty plausible to me that you need a lot more features than people used to look at.
I also don't think the Anthropic paper OP is talking about has come close for Pareto frontier for size <> sparsity <> trainability.
Strong upvote fellow co-author! lol
I agree we shouldn't interpret features by their max-activation, but I think the activation magnitude really does matter. Removing smaller activations affects downstream CE less than larger activations (but this does mean the small activations do matter). A weighted percentage of feature activation captures this more (ie (sum of all golden gate activations)/(sum of all activations)).
I do believe "lower-activating examples don't fit your hypothesis" is bad because of circuits. If you find out that "Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature" then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven't interpreted low GG activations).
Your "code-error" feature example is good. If it only fits "code-error" at the largest feature activations & does other things, then if we ablate this feature, we'll take a capabilities hit because the lower activations were used in other computations. But, let's focus on the lower activations which we don't understand are being used in other computations bit. We could also have "code-error" or "deception" being represented in the lower activations of other features which, when co-occurring, cause the model to be deceptive or write code errors.
[Although, Anthropic showed evidence against this by ablating the code-error feature & running on errored code which predicted a non-error output]
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE's find features on the dataset you give it. For example, we trained an SAE on only chess data (on a chess-finetuned-Pythia model) & all the features were on chess data. I bet if you trained it on code, it'd find only code features (note: I do think some semantic & token level features that would generalize to other domains).
Pragmatically, if there are features you care about, then it's important to train the SAE on many texts that exhibit that feature. This is also true for the safety relevant features.
In general, I don't think you need these 1000x feature expansions. Even a 1x feature expansion will give you sparse features (because of the L1 penalty). If you want your model to [have positive personality traits] then you only need to disentangle those features.
[Note: I think your "SAE's don't find all Othello board state features" does not make the point that SAE's don't find relevant features, but I'd need to think for 15 min to clearly state it which I don't want to do now, lol. If you think that's a crux though, then I'll try to communicate it]
They said 82% of features had a max of 0.3 correlation which (wait, does this imply that 18% of their million billion features did correlate even more???), I agree is a lot. I think this is strongest evidence for "neuron basis is not as good as SAE's", which I'm unsure who is still arguing that, but as a sanity check makes sense.
However, some neurons are monosemantic so it makes sense for SAE features to also find those (though again, 18% of a milllion billion have a higher correlation than 0.3?)
> We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction.
I'm sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn't produce an outlier dimension which there is existing lit on).
[Note: I wrote a lot. Feel free to respond to this comment in parts!]
I do believe "lower-activating examples don't fit your hypothesis" is bad because of circuits. If you find out that "Feature 3453 is a linear combination of the Golden Gate (GG) feature and the positive sentiment feature" then you do understand this feature at high GG activations, but not low GG + low positive sentiment activations (since you haven't interpreted low GG activations).
Yeah, this is the kind of limitation I'm worried about. Maybe for interpretability purposes, it would be good to pretend we have a gated SAE which only kicks in at ~50% max activation. So when you look at the active features all the "noisy" low-activation features are hidden and you only see "the model is strongly thinking about the Golden Gate Bridge". This ties in to my question at the end of how many tokens have any high-activation feature.
Anthropic suggested that if you have a feature that occurs 1/Billion tokens, you need 1 Billion features. You also mention finding important features. I think SAE's find features on the dataset you give it.
This matches my intuition. Do you know if people have experimented on this and written it up anywhere? I imagine the simplest thing to do might be having corpuses in different languages (e.g. English and Arabic), and to train an SAE on various ratios of them until an Arabic-text-detector feature shows up.
I'm sure they actually found very strongly correlated features specifically for the outlier dimensions in the residual stream which Anthropic has previous work showing is basis aligned (unless Anthropic trains their models in ways that doesn't produce an outlier dimension which there is existing lit on).
That would make sense, assuming they have outlier dimensions!
I decided to do a check by tallying the "More Safety Relevant Features" from the 1M SAE to see if they reoccur in the 34M SAE (in some related form).
I don't think we can interpret their list of safety-relevant features as exhaustive. I'd bet (80% confidence) that we could find 34M features corresponding to at least some of the 1M features you listed, given access to their UMAP browser. Unfortunately we can't do this without Anthropic support.
Non-exhaustiveness seems plausible, but then I'm curious how they found these features. They don't seem to be constrained to an index range, and there seem to be nicely matched pairs like this, which I think isn't indicative of random checking:
Around the corner I've got a post that makes use of this post in the final section: Relationships among words, metalingual definition, and interpretability.
These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
TL;DR
In roughly descending order of importance:
I'll now expand some of these into sections. Finally, I'll suggest some follow-up research/tests that I'd love to see Anthropic (or a reader like you) try.
A Feature Isn't Its Highest Activating Examples
Let's look at the Golden Gate Bridge feature because its fun and because it's a good example of what I'm talking about. Here's my annotated version of Anthropic's diagram:
I think Anthropic successfully demonstrated (in the paper and with Golden Gate Claude) that this feature, at very high activation levels, corresponds to the Golden Gate Bridge. But on a median instance of text where this feature is active, it is "irrelevant" to the Golden Gate Bridge, according to their own autointerpretability metric! I view this as analogous to naming water "the drowning liquid", or Boeing the "door exploding company". Yes, in extremis, water and Boeing are associated with drowning and door blowouts, but any interpretation that ends there would be limited.
Anthropic's work writes around this uninterpretability in a few ways, by naming the feature based on the top examples, highlighting the top examples, pinning the intervention model to 10x the activation (vs .1x its top activation), and showing subsamples from evenly spaced intervals (vs deciles). I think would be illuminating if they added to their feature browser page some additional information about the fraction of instances in each subsample, e.g., "Subsample Interval 2 (0.4% of activations)".
Whether a feature is or isn't its top activating examples is important because it constrains their usefulness:
Finding Specific Features
I'm still on my hobbyhorse of asking whether SAEs can find "all" features, or even a specific set of them of interest. This is something Anthropic does not address, and is separate from what they call "specificity. (Their specificity is p(feature is relevant | feature activates), my concern is p(feature is found by the SAE | feature is important).)
Ideally, the SAE would consistently find important features. But does the SAE consistently find any features?
I decided to do a check by tallying the "More Safety Relevant Features" from the 1M SAE to see if they reoccur in the 34M SAE (in some related form). By my count (see table below), 7/22 of them reoccur, and 15/22 do not. Since less than a third of features reoccur (despite a great increase in the number of features), I take this as evidence that the current approach of SAEs is does not have a consistent set of features it finds. This limits what we can expect SAEs to do: even if there's one special feature in Claude that would completely solve AI alignment, whether the SAE finds it may come down to the training seed, or (worse) the SAE may be predisposed against finding it.
My tally (feel free to skip):
Architecture - The Classics, but Wider
Architecture-wise, it seems Anthropic found that the classics work best: they are using a 1-hidden-layer neural network with ReLU activation, untied weights, and biases on the encoder and decoder. There's no special trick here like ghost grads, end-to-end SAEs, or gated SAEs.
Anthropic has also shifted their focus from the MLP layer of the transformer to the residual stream. The sparsity loss term has been rearranged so that the decoder matrix can have unnormalized rows while still contributing the same amount to sparsity loss. I greatly appreciate that Anthropic has spelled out their architecture, including subtler steps like their normalization.
But I was quietly surprised by how many features they were using in their sparse autoencoders (respectively 1M, 4M, or 34M). Assuming Claude Sonnet has the same architecture of GPT-3, its residual stream has dimension 12K so the feature ratios are 83x, 333x, and 2833x, respectively[1]. In contrast, my team largely used a feature ratio of 2x, and Anthropic's previous work "primarily focus[ed] on a more modest 8× expansion". It does make sense to look for a lot of features, but this seemed to be worth mentioning.
Correlations - Strangely Large?
Anthropic measured the correlations between their feature activations and the previous neurons, finding they were often near .3, and said that was pretty small. But unless I'm misunderstanding something, a correlation of .3 is very high!
I'll quote them in full before explaining my confusion (emphasis added):
So here's what I understand Anthropic as doing here: pick a feature at random. Look at its activations on some text (say, 100K tokens), and for each of the ~240K previous neurons[2] compute the neuron activations on those 100K tokens and the correlation between the feature activations and the neuron activations. The reported number is the maximum over the neurons of the correlation between this feature's activation and those neuron activations.
But for a large number of samples, a correlation of 0.3 is insanely large! I wrote some python code that simulate a random process like this, and it doesn't even crack a correlation of 0.02!
My takeaway from this is the opposite of Anthropic's: the features are far more correlated with neurons than you'd expect by chance, even if they are not strongly correlated in an absolute sense. So I'm confused, and either I'm mistaken or the author of that section is.
Can anyone find a simple random process (ie write a modification to my simulator) that yields a correlation of 0.3 without strongly weighting individual neurons?
Future Tests
Here are some tests I'd love Anthropic to run to build on this work:
The size of the residual stream in Claude 3 Sonnet is not public knowledge. But as an estimate: this market has Claude 3 Opus (the larger version) at 1-2T in its 25-75th percentiles. So let's bound Sonnet's size at 1T. Assuming the Claude 3 uses the "standard" GPT-3 architecture, including n_layers=96, a residual stream of 30K puts it at 1T parameters. Thus I'm reasonably confident that the residual stream studied in this paper is ≤30K, so the feature ratios are ≥ 33x, 133x, 1133x.
If Sonnet has the same architecture as GPT-3, it would be 240K neurons= (12K residual stream dimension)*(48 layers)*(4 MLP neurons per layer per residual stream dimension).