Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning
Overview When I began my attempt to replicate Anthropic's Towards Monosemanticity paper, I had high expectations for how interpretable the extracted features should be. If I judged a feature to be uninterpretable, I attributed this to either sub-optimal sparse autoencoder hyperparameters, a bug in my implementation, or possible shortcomings of...