Epistemic status: This is a rough-draft write-up about a thought experiment I did. Reasonably confident about the broad arguments being made here. That said, I haven't spent a lot of time rigorously polishing or reviewing my writing, so minor inaccuracies may be present. 

Interpretability Illusions from Max-Activating Examples

When interpreting an SAE feature, one common technique is to look at the max-activating examples, and try to extrapolate a pattern from there. However,  this approach has two flaws, namely: 

Premise 1: It can be hard to extrapolate the correct pattern. Correctly identifying a pattern relies on accurate understanding of the semantics present in text. Many hypotheses may be likely, given the data, and the actual truth could be non-obvious. It seems easy to make a mistake when doing this. 

Premise 2: Max activating examples may be anomalous. The most likely element of a distribution can look very different from the typical set. Conclusions drawn based on one (or a few) highly activating examples may turn out to be incorrect when evaluated against the majority of examples. 

In the following discussion, I outline a proposal to interpret SAE features using the singular value decomposition of SAE activation patterns, which I think neatly addresses both of these issues. 

Activation Pattern SVD

Suppose we have a set of   SAE features, which we would like to interpret using a dataset of  unique context windows. To do this, we compute the activation , where  describes the activation of feature  on (the last token of) context window . We then compute the singular value decomposition  and take the top  elements. (Assume that we have a good way of choosing , e.g. by looking for an "elbow point" in a reconstruction loss curve)

Remark 3: The SVD defines activation prototypes. note that ; each column (in ) describes a "prototypical" activation pattern of a general SAE feature over  all context windows in the dataset. 

Remark 4: The activation patterns are linear combinations of prototypes. Define the coefficient matrix . Each column (in ) contains coefficients which are used to reconstruct the activation pattern of  a given SAE feature as a linear combination of prototypes. 

Conjecture 5: C is an isometric embedding. The Euclidean distance between columns  is an approximation of the Jaccard distance between the activation patterns of features 

(Note: I am somewhat less certain about Conjecture 5 than the preceding remarks. Well-reasoned arguments for or against are appreciated.) 

How does SVD help? 

Now, let's return to the problems discussed above. I will outline how this proposal solves both of them. 

Re: hardness. Here, we have reduced the problem of interpreting  activation patterns to interpreting  prototypes, which we expect to be more tractable. This may also resolve issues with feature splitting. A counterpoint here is that we expect a prototype to be "broader" (e.g. activating on "sycophantic" or "hallucinatory" inputs in many contexts), and hence less interpretable, than an SAE feature, which is often highly context-specific (e.g. activating only on text where the Golden Gate Bridge was mentioned). 

Re: anomalous-ness. Since we started out with full information about the SAE feature's activations on every input in the dataset, we expect this problem to be largely resolved. 

Concrete experiments. 

Some specific experiments which could be run to validate ideas here. 

  1. C.f. Remark 3, look at the "prototypical" activation patterns and see whether they're more interpretable than typical SAE features. 
  2. C.f. Conjecture 5, compute the coefficient matrix  and pairwise Euclidean distances between columns, then correlate this with the ground-truth Jaccard distances between activation patterns. 

Conclusion

The proposal outlined here is conceptually simple but also pretty computationally intensive, and I'm unsure as to whether it's principled. Nonetheless, it seems like something simple that somebody should try. Feedback is greatly appreciated!

New Comment