Thanks Joel. I appreciated this. Wish I had time to write my own version of this. Alas.
Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:
We were saying this and I think this might be an area of debate in the community for a few reasons. It could be that the "true L0" is actually very high. It could be that low activating features aren't contributing much to your reconstruction and so aren't actually an issue in practice. It's possible the right L1 or L0 is affected by model size, context length or other details which aren't being accounted for in these debates. A thorough study examining post-hoc removal of low activating or low norm features could help. FWIW, it's not obvious to me that L0 should be lower / higher and I think we should be careful not to cargo-cult the stat. Probably we're not at too much risk here since we're discussing this out in the open already.
Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.
Different SAEs, same activations. Makes sense since it's notionally the same vector space. Apollo did this recently when comparing e2e vs vanilla SAEs. I'd love someone to come up with better measures of U-MAP quality as the primary issue with them is the risk of arbitrariness.
Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:
This could be bad. Could also be that the underlying information is messy and there's interference or other weird things going on. Not obvious that it's bad measurement as opposed to messy phenomena imo. Trying to distinguish the two seems valuable.
4. On Scaling
Yup. Training simultaneously could be good. It's an engineering challenge. I would reimplement good proofs of concept that suggest this is feasible and how to do it. I'd also like to point out that this isn't the first time a science has had this issue.
On some level I think this challenge directly parallels bioinformatics / gene sequencing. They needed a human genome project because it was expensive and ambitious and individual actors couldn't do it on their own. But collaborating is hard. Maybe EA in particular can get the ball rolling here faster than it might otherwise. The NDIF / Bau Lab might also be a good banner to line up behind.
I didn’t notice many innovations here -- it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.
Agreed. I think the point was basically scale. Criticisms along the lines of "this is tackling the hard part of the problem or proving interp is actually useful" are unproductive if that wasn't the intention. Anthropic has 3 teams now and counting doing this stuff. They're definitely working on a bunch of harder / other stuff that maybe focuses on the real bottlenecks.
1. How Many Features are Active at Once?
Previously I’ve seen the rule of thumb “20-100 for most models”. Anthropic says:
2. Splitting SAEs
Having multiple different-sized SAEs for the same model seems useful. The dashboard shows feature splitting clearly. I hadn’t ever thought of comparing features from different SAEs using cosine similarity and plotting them together with UMAP.
3. Leaky Features
Neither of these plots seems great. They both suggest to me that these SAEs are “leaky” in some sense at lower activation levels, but in opposite ways:
For reference, here are the meanings of the specificity scores:
Note the low bar for a score of 1 and compare how much area 0 and 1-scored activations take. It looks to me like we can really only trust features above a rough 0.3-0.4 activation level. But note what a small fraction of the total activations have that strength!
4. On Scaling
They don’t give the exact model size (either the depth or d_model). But as a very rough estimate, suppose their model has depth 100. This paper is about a single layer (somewhere in the middle of the model). Imagine doing all of this work 100 times! First, the cost of training multip SAEs and then the cost of analyzing them. The analysis can probably be mostly automated, but that’s still going to be expensive (and take time).
I’m interested in ideas for training SAEs for all layers simultaneously, but if you imagine SAEs expanding the model by 32x (for example), then this would naively take 32x the compute of training the original model, or at least 32x memory if not 32x FLOPs. (This analysis is naive because they’re expanding the residual stream, not the actual MLP / attention parameters, but it should be directionally correct).
All of this work is going to look very similar across different layers, with subtle shifts in meaning (in the same way that the Logit Lens treats all layers as meaning the same thing but the Tuned Lens corrects for this).
5. Good News / Bad News
I didn’t notice many innovations here -- it was mostly scaling pre-existing techniques to a larger model than I had seen previously. The good news is that this worked well. The bad news is that none of the old challenges have gone away.
6. Features Still Seem Crude and Hard to Steer With
We’d really like to understand the model’s model of the world. For example, when working with my coworkers I have a good idea of what they know or don’t, including some idea of both their general background, strengths and weaknesses, what we’ve worked on together, and their current state of mind. I’d expect language models to model their interlocutor in a similar way, but the best we can currently say is “300 features, including Openness and Honesty, etc, are active.”
7. Predicting the Presence of a Feature
There’s a very nice fit on this curve.
8. Missing Features
You could potentially solve this by scaling SAEs way up, but that just makes the compute challenges even worse. I haven’t seen any research on this but you could imagine training the SAE to generate specific features that you want to appear.
9. The Thatcher Feature
Why does the Thatcher feature treat her name so inconsistently (re the tokens it fires strongly / weakly on)?
10. The Lincoln Feature
The Lincoln feature is remarkably clean and clearly shows how smoothly the model handles different tokenizations.
11. The Rwanda Feature
It feels a bit implausible to me that this is genuinely a Rwanda feature if it doesn't fire on "Rwanda" at all.
12. The Los Angeles Feature
323 and 213 are apparently LA area codes. Why does the model fire more strongly on later tokens, which aren't LA-specific? Similarly with the URLs.
13. Activation Strength vs Attribution
14. Other Takes
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
eggsyntax's Shortform