Armaan A. Abraham - LessWrong

Deep sparse autoencoders yield interpretable features too

This is great. I'm a bit surprised you get such a big performance improvement from adding additional sparse layers; all of my experiments above have been adding non-sparse layers, but it looks like the MSE benefit you're getting with added sparse layers is in the same ballpark. You have certainly convinced me to try muon.

Another approach that I've (very recently) found quite effective in reducing the number of dead neurons with minimal MSE hit has been adding a small penalty term on the standard deviation of the encoder pre-act (i.e., before the top-k) means across the batch dimension. This has basically eliminated my dead neuron woes and this is what I'm currently running with. I'll probably try this in combination with muon sometime over the next couple of days.

And these ideas all sound great.

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham1mo20

I will look into these optimizers, thank you for the tip!

I was aware that dead neurons get excluded from the auto-interpretability pipeline. My comment about dead neurons affecting the score was more about the effective reduction in sparse dimension due to the dead neurons being an issue for the neurons that are alive.

I would be very interested in any progress you make related to more granular automated interpretability information. Do you currently have any ideas for what this might look like? I've given it a tiny bit of thought, but haven't gotten very far.

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham1mo40

Ah, I was unaware of that paper and it is indeed relevant to this, thank you! Yes, by "dense" or "non-sparse" layer, I mean a nonlinearity. So, that paper's MLP SAE is similar to what I do here, except it is missing MLPs in the decoder. Early on, I experimented with such an architecture with encoder-only MLPs, because (1) as to your final point, the lack of nonlinearity in the output potentially helps it fit into other analyses and (2) it seemed much more likely to me to exhibit monosemantic features than an SAE with MLPs in the decoder too. But, after seeing some evidence that its dead neuron problems reacted differently to model ablations than both the shallow SAE and the deep SAE with encoder+decoder MLPs, I decided to temporarily drop it. I figured that if I found that the encoder+decoder MLP SAE features were interpretable, this would be a more surprising/interesting result than the encoder-only MLP SAE and I would run with it, and if not, I would move to the encoder-only MLP SAE.

I trained on 7.5e9 tokens.

As I mentioned in my response to your first question, I did experiment early on with the encoder-only MLP, but the architectures in this post are the only ones I looked at in depth for GPT2.

This is a good point, and I should have probably included this in the original post. As you said, one of the major limitations of this approach is that the added nonlinearities obscure the relationship between deep SAE features and upstream / downstream mechanisms of the model. In the scenario where adding more layers to SAEs is actually useful, I think we would be giving up on this microscopic analysis, but also that this might be okay. For example, we can still examine where features activate and generate/verify human explanations for them. And the idea is that the extra layers would produce features that are increasingly meaningful/useful for this type of analysis.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments