seems great for mechanistic anomaly detection! very intuitive to map ADP to surprise accounting (I was vaguely trying to get at a method like ADP here)
Agree! I'd be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead.
(btw typo: *APD)
IMO most exciting mech-interp research since SAEs, great work
I think so too! (assuming it can be made more robust and scaled, which I think it can)
And thanks! :)
We're aware of model diffing work like this, but I wasn't aware of this particular paper.
It's probably an edge case: They do happen both to be in weight space and to be suggestive of weight space linearity. Indeed, our work was informed by various observations from a range of areas that suggest weight space linearity (some listed here).
On the other hand, our work focused on decomposing a given network's parameters. But the line of work you linked above seems more in pursuit of model editing and understanding the difference between two similar models, rathe...
It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.
At Apollo we're currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.
So I believe I had in mind "active means [is achieved deliberately through the agent's actions]".
I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.
The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions requir...
Extremely glad to see this! The Guez et al. model has long struck me as one of the best instances of a mesaoptimizer and it was a real shame that it was closed source. Looking forward to the interp findings!
Both of these seem like interesting directions (I had parameters in mind, but params and activations are too closely linked to ignore one or the other). And I don't have a super clear idea but something like representational similarity analysis between SwitchSAEs and regular SAEs could be interesting. This is just one possibility of many though. I haven't thought about it for long enough to be able to list many more, but it feels like a direction with low hanging fruit for sure. For papers, here's a good place to start for RSA: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3730178/
Great work! Very excited to see work in this direction (In fact, I didn't know you were working on this, so I'd expressed enthusiasm for MoE SAEs in our recent list of project ideas published just a few days ago!)
Comments:
...Following Fedus et al., we route to a single expert SAE. It is possible that selecting several experts will improve
Thanks! Fixed now
I've no great resources in mind for this. Off the top of my head, a few examples of ideas that are common in comp neuro that might have bought mech interp some time if more people in mech interp were familiar with them when they were needed:
- Polysemanticity/distributed representations/mixed selectivity
- Sparse coding
- Representational geometry
I am not sure about what future mech interp needs will be, so it's hard to predict which ideas or methods that are common in neuro will be useful (topological data analysis? Dynamical systems?). But it ju...
I'll add a strong plus one to this and a note for emphasis:
Representational geometry is already a long-standing theme in computational neuroscience (e.g. Kriegeskorte et al., 2013)
Overall I think mech interp practitioners would do well to pay more attention to ideas and methods in computational neuroscience. I think mech interp as a field has a habit of overlooking some hard-won lessons learned by that community.
I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!
There's recent work published on this here by Chris Mathwin, Dennis Akar, and me. The gated attention block is a kind of transcoder adapted for attention blocks.
Nice work by the way! I think this is a promising direction.
Note also the similar, but substantially different, use of the term transcoder here, whose problems were pointed out to me by...
Trying to summarize my current understanding of what you're saying:
Yes all four sound right to me.
To avoid any confusion, I'd just add an emphasis that the descriptions are mathematical, as opposed semantic.
...I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems
Hm I think of the (network, dataset) as scaling multiplicatively with size of network and size of dataset. In the thread with Erik above, I touched a little bit on why:
"SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What's a mathematical description of the (network, dataset), then? It's just what you get when you pass the dataset through the network; this datum interacts with thi...
Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.
I'll register that I prefer using 'description' instead of 'explanation' in most places. The reason is that 'explanation' invokes a notion of understanding, which requ...
Thanks Aidan!
I'm not sure I follow this bit:
In my mind, the reconstruction loss is more of a non-degeneracy control to encourage almost-orthogonality between features.
I don't currently see why reconstruction would encourage features to be different directions from each other in any way unless paired with an L_{0<p<1}. And I specifically don't mean L1, because in toy data settings with recon+L1, you can end up with features pointing in exactly the same direction.
Thanks Erik :) And I'm glad you raised this.
One of the things that many researchers I've talked to don't appreciate is that, if we accept networks can do computation in superposition, then we also have to accept that we can't just understand the network alone. We want to understand the network's behaviour on a dataset, where the dataset contains potentially lots of features. And depending on the features that are active in a given datum, the network can do different computations in superposition (unlike in a linear network that can't do s...
So, for models that are 10 terabytes in size, you should perhaps be expecting a "model manual" which is around 10 terabytes in size.
Yep, that seems reasonable.
I'm guessing you're not satisfied with the retort that we should expect AIs to do the heavy lifting here?
Or perhaps you don't think you need something which is close in accuracy to a full explanation of the network's behavior.
I think the accuracy you need will depend on your use case. I don't think of it as a globally applicable quantity for all of interp.
For instance, maybe to 'aud...
Thanks for this feedback! I agree that the task & demo you suggested should be of interest to those working on the agenda.
It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose.
There were a few purposes proposed, and at multiple levels of abstraction, e.g.
I'm g...
I'm curious if you believe that, even if SAEs aren't the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?
There isn't any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.
I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to su...
This is a good idea and is something we're (Apollo + MATS stream) working on atm. We're planning on releasing our agenda related to this and, of course, results whenever they're ready to share.
Makes sense! Thanks!
Great! I'm curious, what was it about the sparsity penalty that you changed your mind about?
Hey thanks for your review! Though I'm not sure that either this article or Cunningham et al. can reasonably be described as a reproduction of Anthropic's results (by which I assume you're talking about Bricken et al.), given their relative timings and contents.
Comments on the outcomes of the post:
Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877
But I think you're right to flag it - other references don't really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535
Thanks Akash!
I agree that this feels neglected.
Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712
Looking forward to it coming out!
Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.
Dictionary learning - This is one of my main bets for comprehensive interpretability.
Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709
No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report.
I strongly suspect this is the case too!
In fact, we might be able to speed up the learning of common features even further:
Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't ha...
And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.
I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers
I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.
I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the ot...
Thanks for your interest!
The autoencoder losses reported are the train losses. And you're right to point at noise potentially being an issue. It's my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.
> I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corne...
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.
Hm, I don't think this quite captures what I view the post as saying.
...Maybe what you’re thinking is: “Maybe Future Company X will program a
That's correct. 'Correlated features' could ambiguously mean "Feature x tends to activate when feature y activates" OR "When we generate feature direction x, its distribution is correlated with feature y's". I don't know if both happen in LMs. The former almost certainly does. The second doesn't really make sense in the context of LMs since features are learned, not sampled from a distribution.
There should be a neat theoretical reason for the clean power law where L1 loss becomes too big. But it doesn't make intuitive sense to me - it seems like if you just add some useless entries in the dictionary, the effect of losing one of the dimensions you do use on reconstruction loss won't change, so why should the point where L1 loss becomes too big change? So unless you have a bug (or some weird design choice that divides loss by number of dimensions), those extra dimensions would have to be changing something.
The L1 loss on the activations does indee...
In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there's no particular reason to think the feature scales are normalized very much (though maybe they're normalized a bit due to weight decay and similar). Is there some reason this is ok?
Hm it's a great point. There's no principled reason for it. Equivalently, there's no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a ...
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.
Thanks! Amended.
Figure 3 in this paper (AtP*, Kramar et al.) illustrates the point nicely: https://arxiv.org/abs/2403.00745