Open Source Automated Interpretability for Sparse Autoencoder Features
Generated by Dalle Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings * Open source models generate and evaluate text explanations of SAE features...
Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.
How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?
Are the sentences shown individually or in batches?
I'm not sure, I understand the reasoning behind your simulation scoring method and its validity. You... (read more)