This seems easy to try and a potential point to iterate from, so you should give it a go. But I worry that and will be dense and very uninterpretable:
I'm keen to see stuff in this direction though! I certainly think you could construct some matrix or tensor of SAE activations such that some decomposition of it is interpretable in an interesting way.
Interesting! I'd love to see a few concrete examples if it's possible & reasonably easy to hand-compute some. I've recently seen it argued (can't recall where unfortunately) that it's easy with max-activating examples to get sensitivity, but extremely hard to also get specificity.
Epistemic status: This is a rough-draft write-up about a thought experiment I did. Reasonably confident about the broad arguments being made here. That said, I haven't spent a lot of time rigorously polishing or reviewing my writing, so minor inaccuracies may be present.
Interpretability Illusions from Max-Activating Examples
When interpreting an SAE feature, one common technique is to look at the max-activating examples, and try to extrapolate a pattern from there. However, this approach has two flaws, namely:
Premise 1: It can be hard to extrapolate the correct pattern. Correctly identifying a pattern relies on accurate understanding of the semantics present in text. Many hypotheses may be likely, given the data, and the actual truth could be non-obvious. It seems easy to make a mistake when doing this.
Premise 2: Max activating examples may be anomalous. The most likely element of a distribution can look very different from the typical set. Conclusions drawn based on one (or a few) highly activating examples may turn out to be incorrect when evaluated against the majority of examples.
In the following discussion, I outline a proposal to interpret SAE features using the singular value decomposition of SAE activation patterns, which I think neatly addresses both of these issues.
Activation Pattern SVD
Suppose we have a set of M SAE features, which we would like to interpret using a dataset of N unique context windows. To do this, we compute the activation A∈RN×M, where Aij describes the activation of feature j on (the last token of) context window i. We then compute the singular value decomposition A=UΣV and take the top k elements. (Assume that we have a good way of choosing k, e.g. by looking for an "elbow point" in a reconstruction loss curve)
Remark 3: The SVD defines activation prototypes. note that U∈RN×k; each column (in RN) describes a "prototypical" activation pattern of a general SAE feature over all context windows in the dataset.
Remark 4: The activation patterns are linear combinations of prototypes. Define the coefficient matrix C=ΣM∈Rk×M. Each column (in Rk) contains coefficients which are used to reconstruct the activation pattern of a given SAE feature as a linear combination of prototypes.
Conjecture 5: C is an isometric embedding. The Euclidean distance between columns Ci,Cj is an approximation of the Jaccard distance between the activation patterns of features i,j.
(Note: I am somewhat less certain about Conjecture 5 than the preceding remarks. Well-reasoned arguments for or against are appreciated.)
How does SVD help?
Now, let's return to the problems discussed above. I will outline how this proposal solves both of them.
Re: hardness. Here, we have reduced the problem of interpreting M activation patterns to interpreting K prototypes, which we expect to be more tractable. This may also resolve issues with feature splitting. A counterpoint here is that we expect a prototype to be "broader" (e.g. activating on "sycophantic" or "hallucinatory" inputs in many contexts), and hence less interpretable, than an SAE feature, which is often highly context-specific (e.g. activating only on text where the Golden Gate Bridge was mentioned).
Re: anomalous-ness. Since we started out with full information about the SAE feature's activations on every input in the dataset, we expect this problem to be largely resolved.
Concrete experiments.
Some specific experiments which could be run to validate ideas here.
Conclusion
The proposal outlined here is conceptually simple but also pretty computationally intensive, and I'm unsure as to whether it's principled. Nonetheless, it seems like something simple that somebody should try. Feedback is greatly appreciated!