lewis smith - LessWrong

The ‘strong’ feature hypothesis could be wrong

I guess you are thinking about holes with the p-type semiconductor?

I don't think I agree (perhaps obviously) that it's better to think about the issues in the post in terms of physics analogies than in terms of the philosophy of mind and language. If you are thinking about how a mental representation represents some linguistic concept, then Dennett and Wittgenstein (and others!) are addressing the same problem as you! in a way that virtual particles are really not

lewis smith's Shortform

lewis smith14d40

I think atom would be a pretty good choice as well but I think the actual choice of terminology is less important than the making the distinction. I used latent here because that's what we used in our paper

lewis smith's Shortform

lewis smith16d5949

‘Feature’ is overloaded terminology

In the interpretability literature, it’s common to overload ‘feature’ to mean three separate things:

Some distinctive, relevant attribute of the input data. For example, being in English is a feature of this text.
A general activity pattern across units which a network uses to represent some feature in the first sense. So we might say that a network represents the feature (1) that the text is in English with a linear feature (2) in layer 32.
The elements of some process for discovering features, like an SAE. An SAE learns a dictionary of activation vectors, which we hope correspond to the features (2) that the network actually uses. It is common to simply refer to the elements of the SAE dictionary as ‘features’ as well. For example, we might say something like ‘the network represents features of the input with linear features in its representation space, which are recovered well by feature 3242 in our SAE.

This seems bad; at best it’s a bit sloppy and confusing, at worst it’s begging the question about the interpretability or usefulness of SAE features. It seems important to carefully distinguish between them in case these don’t coincide. We think that it’s probably worth making a bit of an effort to carefully distinguish between all of these different concepts by giving them different names. A terminology that we prefer is to reserve ‘feature’ for the conceptual senses of the word (1 and 2) and use alternative terminology for case 3, like ‘SAE latent’ instead of ‘SAE feature’. So we might say, for example, that a model uses a linear representation for a Golden Gate Bridge feature, which is recovered well by an SAE latent. We have tried to follow this terminology in our recent Gemma Scope report. We think that this added precision is helpful in thinking about feature representations, and SAEs more clearly. To illustrate what we mean with an example, we might ask whether a network has a feature for numbers - i.e, whether it has any kind of localized representation of numbers at all. We can then ask what the format of this feature representation is ; for example, how many dimensions does it use, where is it located in the model, does it have any kind of geometric structure, etc. We could then separately ask whether it is discovered by a feature discovery algorithm; i.e is there a latent in a particular SAE that describes it well? We think it’s important to recognise that these are all distinct questions, and use a terminology that is able to distinguish between them.

We (the deepmind language model interpretability team) started using this terminology in the GemmaScope report, but we didn’t really justify the decision much there and I thought it was worth making the argument separately.

The ‘strong’ feature hypothesis could be wrong

lewis smith18d10

I'm not that confident about the statistical query dimension (I assume that's what you mean by SQ dimension?) But I don't think it's applicable; SQ dimension is about the difficulty of a task (e.g binary parity), wheras explicit vs tacit representations are properties of an implementation, so it's kind of apples to oranges.

To take the chess example again, one way to rank moves is to explicitly compute some kind of rule or heuristic from the board state, and another is to do some kind of parallel search, and yet another is to use a neural network or something similar. The first one is explicit, the second is (maybe?) more tacit, and the last is unclear. I think stronger variations of the LRH kind of assume that the neural network must be 'secretly' explicit, but I'm not really sure this is neccesary.

But I don't think any of this is really affected by the SQ dimension because it's the same task in all three cases (and we could possibly come up with examples which had identical performance?)

but maybe i'm not quite understanding what you mean

The ‘strong’ feature hypothesis could be wrong

lewis smith1mo10

yeah, I think this paper is great!

The ‘strong’ feature hypothesis could be wrong

lewis smith1mo10

i'm glad you liked it.

I definitely agree that the LRH and the interpretability of the linear features are seperate hypotheses; that was what I was trying to get at by having monosemanticity as a seperate assumption to the LRH. I think that these are logically independent; there could be some explicit representation such that everything corresponds to an interpretable feature, but that format is more complicated than linear (i.e monosemanticity is true but LRH is false) or, as you say, the network could in some sense be mostly manipulating features but these features could be very hard to understand (LRH true, monosemanticity false) or they could just both be the wrong frame. I definitely think it would be good if we spent a bit more effort in clarifying these distinctions; I hope this essay made some progress in that direction but I don't think it's the last word on the subject.

I agree coming up with experiments which would test the LRH in isolation is difficult. But maybe this should be more of a research priority; we ought to be able to formulate a version of the strong LRH which makes strong empirical predictions. I think something along the lines of https://arxiv.org/abs/2403.19647 is maybe going in the write direction here. In a shameless self-plug, I hope that LMI's recent work on open sourcing a massive SAE suite (Gemma Scope) will let people test out this sort of thing.

Having said that, one reason I'm a bit pessimistic is that stronger versions of the LRH do seem to predict there is some set of 'ground truth' features that a wide-enough or well tuned enough SAE ought to converge to (perhaps there should be some 'phase change' in the scaling graphs as you sweep the hyperparameters), but AFAIK we have been unable to find any evidence for this even in toy models.

I don't want to overstate this point though; I think part of the reason for the excitement around SAEs is that this was genuinely quite great science ; the Toy Models paper proposed some theoretical reasons to expect linear representations in superposition, which implied that something like SAEs should recover interesting representations, and then was quite successful! (This is why I say in the post I think there's a reasonable amount of evidence for at least the weak LRH).

The ‘strong’ feature hypothesis could be wrong

lewis smith1mo30

I'm not entirely sure I follow here; I am thinking of compositionally as a feature of the format of a representation (Chris Olah has a good note on this here https://transformer-circuits.pub/2023/superposition-composition/index.html). I think whether we should expect one kind of representation or another is an interesting question, but ultimately an empirical one: there are some theoretical arguments for linear representations (basically that they should be easy for NNs to make decisions based on them) but the biggest reason to believe in them is just that people genuinely have found lots of examples of linear mediators that seem quite robust (e.g Golden Gate claude, neels stuff on refusal directions)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith1mo10

we have now done this https://github.com/google-deepmind/mishax

[Full Post] Progress Update #1 from the GDM Mech Interp Team

lewis smith5mo10

Maybe this is on us for not including enough detail in the post, but I'm pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the 'shrinkage' problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn't just fiddle round with low weight, it also chooses features 'better'.

With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don't think that would do much; SAE encoders tend to have much more features close to zero, because it's structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.

One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.

Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn't spot any obvious problems with the ITO decomposition, but we haven't done a properly systematic double blind comparison of ITO to encoder 'explanations' in terms of interpretability because it's quite expensive for us in terms of time.

I think that it's too early to say whether ITO is 'really' helping or not, but I am pretty confident it's worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn't invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it's still quite hard to give sparse decompositions an 'interpretability score' which is objective and not too expensive to make, so it's a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)

I think there is a fairly reasonable theoretical argument that non-SAE decompositions won't work well for superposition (because the NN can't actually be using an iterative algorithm to read features) but I do think that I haven't really seen any empirical evidence that this is either true or false to be honest, and I don't think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.

DSLT 3. Neural Networks are Singular

lewis smith10mo10

Yeah I agree with everything you say; it's just I was trying to remind myself of enough of SLT to give a a 'five minute pitch' for SLT to other people, and I didn't like the idea that I'm hanging it of the ReLU.

I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.

I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I'm interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about 'oh well the posterior is probably-sorta-gaussian' played a big role in it's longevity as an idea.

yeah it's not totally clear what this 'nearly singular' thing would mean? Intuitively, it might be that there's a kind of 'hidden singularity' in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing

LESSWRONG
LW

Posts

Wiki Contributions

Comments