1 min read

12

This is a special post for quick takes by lewis smith. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
7 comments, sorted by Click to highlight new comments since:

‘Feature’ is overloaded terminology

In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:

  1. Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.

  2. A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.

  3. The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ‘the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE.

This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers - i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them.

We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.

[-]leogao6-4

I like the word "atom" to refer to units inside an SAE

But they're not atomic! See eg the phenomena of feature splitting, and the fact that UMAP finds structure between semantically similar features

(In fairness, atoms are also not very atomic)

[-]leogao142

Extremely valid, you've convinced me that atom is probably a bad term for this

I think atom would be a pretty good choice as well but I think the actual choice of terminology is less important than the making the distinction. I used latent here because that's what we used in our paper

Thanks for writing this up Lewis! I'm very happy with this change, I think the term "SAE feature" is kinda sloppy and anti-conducive to clear thinking, and I hope the rest of the field adopts this too.

This is the clearest explanation I have seen on this topic, thank you!