User Comment Replies

Deep sparse autoencoders yield interpretable features too

This is great. I'm a bit surprised you get such a big performance improvement from adding additional sparse layers; all of my experiments above have been adding non-sparse layers, but it looks like the MSE benefit you're getting with added sparse layers is in the same ballpark. You have certainly convinced me to try muon.

Another approach that I've (very recently) found quite effective in reducing the number of dead neurons with minimal MSE hit has been adding a small penalty term on the standard deviation of the encoder pre-act (i.e., before the top-k) mea... (read more)

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham2mo20

I will look into these optimizers, thank you for the tip!

I was aware that dead neurons get excluded from the auto-interpretability pipeline. My comment about dead neurons affecting the score was more about the effective reduction in sparse dimension due to the dead neurons being an issue for the neurons that are alive.

I would be very interested in any progress you make related to more granular automated interpretability information. Do you currently have any ideas for what this might look like? I've given it a tiny bit of thought, but haven't gotten very far.

3luciaquirke2mo

I tried stacking top-k layers ResNet-style on MLP 4 of TinyStories-8M and it worked nicely with Muon, with fraction of variance explained reduced by 84% when going from 1 to 5 layers (similar gains to 5xing width and k), but the dead neurons still grew with the number of layers. However dropping the learning rate a bit from the preset value seemed to reduce them significantly without loss in performance, to around 3% (not pictured). Still ideating but I have a few ideas for improving the information-add of Delphi: * For feature explanation scoring it seems important to present a mixture of activating and semantically similar non-activating examples to the explainer and to the activation classifier, rather than a mixture of activating and random (probably very dissimilar) examples. We're introducing a few ways to do this, e.g. using the neighbors option to generating the non-activating examples. I suspect a lot of token-in-context features are being incorrectly explained as token features when we use random non-activating examples. * I'm interested in weighting feature interpretability scores by their firing rate, to avoid incentivizing sneaking through a lot of superposition in a small number of latents (especially for things like matryoshka SAEs where not all latents are trained with the same loss function). * I'm interested in providing the "true" and unbalanced accuracy given the feature firing rates, perhaps after calibrating the explainer model to use that information. * I think it would be cool to log the % of features with perfect interpretability scores, or another metric that pings features which sneak through polysemanticity at low activations. * Maybe measuring agreement between explanation generations on different activation quantiles would be interesting? Like if a high quantile is best interpreted as "dogs at the park" and a low quantile just "dogs" we could capture that. * Like a measure of specificity drop-off https://github.com/EleutherAI

Deep sparse autoencoders yield interpretable features too

Armaan A. Abraham2mo40

Ah, I was unaware of that paper and it is indeed relevant to this, thank you! Yes, by "dense" or "non-sparse" layer, I mean a nonlinearity. So, that paper's MLP SAE is similar to what I do here, except it is missing MLPs in the decoder. Early on, I experimented with such an architecture with encoder-only MLPs, because (1) as to your final point, the lack of nonlinearity in the output potentially helps it fit into other analyses and (2) it seemed much more likely to me to exhibit monosemantic features than an SAE with MLPs in the decoder too. But, after see... (read more)

4Logan Riggs2mo

I agree. There is a tradeoff here for the L0/MSE curve & circuit-simplicity. I guess another problem (w/ SAEs in general) is optimizing for L0 leads to feature absorption. However, I'm unsure of a metric (other than the L0/MSE) that does capture what we want.

LESSWRONG
LW

All of Armaan A. Abraham's Comments + Replies