TL;DR
Sparse autoencoders (SAEs) presents us a promising direction towards automating mechanistic interpretability, but it not without flaws. One known issue of the original sparse autoencoders is the feature suppression effect which is caused by the conflict between the L2 and L1 loss and the unit norm constraint on the SAE decoders. This effect in theory will be more evident when we have inputs that have high norms. Another observation is that training SAEs on multiple layers simultaneously results in inconsistent L0 norms for feature activations across layers: in some layers, L0 has scale of 102 , while in some other layers it has a scale of 101. Moreover, the residual states that's inputed to the SAEs for training also have different... (read 3719 more words →)
The additional experiment under Experiment-Performance Verification (Figure 11) compares
normalized_1andbaseline_1on layer 5 which have almost identical L0. The result showed no observable difference.