I'd definitely be interested if you have any takes on tied vs. untied weights.
It seems like the goal is to get the maximum expressive power for the autoencoder, while still learning features that are linear. So are untied weights good because they're more expressive, or are they bad because they encourage the model to start exploiting nonlinearity?
Note: My take does not neccesarily represent the takes of my coauthors (Hoagy, Logan, Lee, Robert) etc etc. Or it might, but they may frame it differently. Take this as strictly my take.
My take is that the goal isn't strictly to get maximum expressive power under the assumptions detailed in Toy Models of Superposition; for instance, Anthropic found that FISTA-based dictionaries didn't work as well as sparse autoencoders, even though they are better in that they can achive lower reconstruction loss at the same level of sparsity. We might find that the sparsity-monosemanticity link breaks down at higher levels of autoencoder expressivity, although this needs to be rigourously tested.
To answer your question: I think Hoagy thinks that tied weights are more similar to how an MLP might use features during a forward pass, which would involve extracting the feature through a simple dot-product. I'm not sure I buy this, as having untied weights is equivalent to allowing the model to express simple linear computations like 'feature A activation = dot product along feature A direction - dot product along feature B direction', which could be a form of denoising if A and B were mutually exclusive but non-orthogonal features.
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn't on the training side of the project so these are less informed than Hoagy, Logan, and Aidan's views:
This shows that we are in a period of Multiple_discovery.
I interpret this as a period when many teams are racing for low-hanging fruit, and it reinforces that we are in scientific race dynamics where no one team is critical, and slowing down research requires stopping all teams.
This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features?
This is something we're planning to look into! From the paper:
Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.
Exactly how to use them is something we're still working on...
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Target of Dictionary Learning/Sparse Autoencoding
A primary difference is that we looked for language model features in different parts of the model. My team trained our sparse autoencoder on the residual stream of a language model, whereas Anthropic trained on the activations in the MLP layer.
These objects have some fundamental differences. For instance, the residual stream is (potentially) almost completely linear whereas the MLP activations have just gotten activated, so their values will be positive-skewed. However, it's encouraging that this technique seems to work on both the MLP layer and residual stream. Additionally, my coauthor Logan Riggs successfully applied it to the output of the attention sublayer, so both in theory and in practice the dictionary learning approach seems to work well on each part of a language model.
Language Model Used
Another set of differences comes from which language model our teams used to train the autoencoders. My team used Pythia-70M and Pythia-410M, whereas Anthropic's language model was custom-trained for this study (I think). Some differences in the language model architectures:
Sparse Autoencoder Architecture
Similarities:
But some significant differences remain:
In other words, we perform this calculation:
^x=WT∗ReLU(W∗x+b)
whereas Anthropic does this calculation:
^x=VT∗ReLU(W∗x+b)+c
more precisely, Anthropic writes their calculation in these terms:
¯x=x−bdf=ReLU(We¯x+be)^x=Wdf+bd
which is equivalent to the above with b=−Webd+be etc.
Sparse Autoencoder Training
There are two main differences between how we trained our sparse autoencoders and how Anthropic trained theirs:
Checking Success
[Epistemic status warning: I'm less sure I've fully capture Anthropic's work in this section.]
Finally, how did we decide the features were interpretable?
Our team also performed these measures:
Anthropic also performed these measures:
Thanks to Logan and Aidan for feedback on an earlier draft of this post.