Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static local learning coefficient (LLC) estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features...
Regarding the point about low FT-LLC: I agree it's not immediately obvious random directions should have lower FT-LLC than "feature directions". It's something we have thought about, and in fact we weren't sure what exactly to expect.
On the one hand there is the model you propose where random directions are the sum of separate circuits and thus their FT-LLC is expected to be larger than that of feature directions, as their computation depends on the union of the parameter sets of the features involved
On the other hand one can imagine other models too, for example one could think of the computation of the layer activation as being composed of many different directions... (read 359 more words →)