All of Eoin Farrell's Comments + Replies

Interesting, thanks for sharing! Are there specific existing ideas you think would be valuable for people to look at in the context of SAEs & language models, but that they are perhaps unaware of?

Thanks! 

One cheap and lazy approach is to see how many of your features have high cosine similarity with the features of an existing L1-trained SAE (e.g. "900 of the 2048 features detected by the  -trained model had cosine sim > 0.9 with one of the 2048 features detected by the L1-trained model").

I looked at the cosine sims between the L1-trained reference model and one of my SAEs presented above and found:

  • 2501 out of 24576 (10%) of the features detected by the  -trained model had cosine sim > 0.9 with one of the 24576
... (read more)
2faul_sname
The other baseline would be to compare one L1-trained SAE against another L1-trained SAE -- if you see a similar approximate "1/10 have cossim > 0.9, 1/3 have cossim > 0.8, 1/2 have cossim > 0.7" pattern, that's not definitive proof that both approaches find "the same kind of features" but it would strongly suggest that, at least to me.

Did you ever run just the L0-approx & sparsity-frequency penalty separately? It's unclear if you're getting better results because the L0 function is better or because there are less dead features.

 

Good point - this was also somewhat unclear to me. What I can say is that when I run with the L0-approx penalty only, without the sparsity frequency penalty, I either get lots of dead features (50% or more), with a substantially worse MSE (a factor of a few higher), similar to when I run with only an L1 penalty.  When I run with the sparsity-freque... (read more)