Lidor Banuel Dabbah

Posts

Sorted by New

59Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Ω

2mo

Ω

6

Wiki Contributions

Comments

Sorted by

Newest

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah2mo60

Regarding the point about low FT-LLC: I agree it's not immediately obvious random directions should have lower FT-LLC than "feature directions". It's something we have thought about, and in fact we weren't sure what exactly to expect.

On the one hand there is the model you propose where random directions are the sum of separate circuits and thus their FT-LLC is expected to be larger than that of feature directions, as their computation depends on the union of the parameter sets of the features involved

On the other hand one can imagine other models too, for example one could think of the computation of the layer activation as being composed of many different directions each having their own circuit computing the thresholded feature corresponding to them. Then we get a "background noise" of influence from correlated features on the FT-LLC of any direction - random and non-random alike. In this case I think we should expect features to have higher FT-LLC which is composed of the "background noise" + influence of the dedicated circuit.

In a sense these experiments were our attempt to get some sense of what happens here for real neural networks, in order to have some grounding for our more basic research (both theoretical and empirical on more basic models and scenarios). It seems the results give some evidence in favor of models like the "background noise" one, but more research is needed before concluding this question. We do think these results inform us of the kind of scenarios and questions we would like to model and answer in more basic settings.

Also, note that our experiment were on GPT-2-small layer 9 which is quite close to the end of the network, and it's reasonable to expect that this fact has influence on the kind of FT-LLC values we expect to see for features compared to random directions. We think it's interesting to see what happens when we take layers closer to the beginning of the network and are checking that.

Regarding the second point: now that we have finished the exploratory phase and have the evidence and grounding we wanted we indeed turn to more basic research. We are aware of the limitations of the numeric methods we use, and a major question for us was whether the numbers the method produces actually are informative enough to catch the kind of phenomena we ultimately care about — namely meaningful localized computations within the model. In this sense these results also strengthened our belief that this is indeed the case, though this is a single experiment and we are conducting more to verify the results generalize.

We agree that comparing numeric FT-LLC estimation to counting Hessian zero eigenvalues is a good idea, thank you for suggesting it. We will check and share the results of such a comparison.

With respect to both points, in retrospect it would have been better to address them directly in the post, so thanks for the comment and we will add a reference to it for the benefit of future readers.

Reply