Lucius Bushnaq - LessWrong

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

AI notkilleveryoneism researcher, focused on interpretability.

Personal account, opinions are my own.

I have signed no contracts or agreements whose existence I cannot mention.

My guess is this is obvious, but IMO it seems extremely unlikely to me that bee-experience is remotely as important to care about as cow experience.

I agree with this, but would strike the 'extremely'. I don't actually have gears level models for how some algorithms produce qualia. 'Something something, self modelling systems, strange loops' is not a gears level model. I mostly don't think a million neuron bee brain would be doing qualia, but I wouldn't say I'm extremely confident.

Consequently, I don't think people who say bees are likely to be conscious are so incredibly obviously making a mistake that we have to go looking for some signalling explanation for them producing those words.

But there's no reason to think that the model is actually using a sparse set of components /features on any given forward pass.

I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.

I'd have to think about the exact setup here to make sure there's no weird caveats, but my first thought is that for , this ought to be one component per bigram, firing exclusively for that bigram.

An intuition pump: Imagine the case of two scalar features being embedded along vectors $f_{1}, f_{2}$ . If you consider a series that starts with $f_{1}, f_{2}$ being orthogonal, then gives them ever higher cosine similarity, I'd expect the network to have ever more trouble learning to read out $c_{1}$ , $c_{2}$ , until we hit $f_{1} = f_{2}$ , at which point the network definitely cannot learn to read the features out at all. I don't know how the learning difficulty behaves over this series exactly, but it sure seems to me like it ought to go up monotonically at least.

Another intuition pump: The higher the cosine similarity between the features, the larger the norm of the rows of $V^{- 1}$ will be, with norm infinity in the limit of cosine similarity going to one.

I agree that at cosine similarity $O (\frac{1}{\sqrt{1000}})$ , it's very unlikely to be a big deal yet.

Sure, yes, that's right. But I still wouldn't take this to be equivalent to our literally being orthogonal, because the trained network itself might not perfectly learn this transformation.

What do you mean by "a global linear transformation" as in what kinds of linear transformations are there other than this? If we have an MLP consisting of multiple computations going on in superposition (your sense) I would hope that the W_in would be decomposed into co-activating subcomponents corresponding to features being read into computations, and the W_out would also be decomposed into co-activating subcomponents corresponding to the outputs of those computations being read back into the residual stream. The fact that this doesn't happen tells me something is wrong.

Linear transformations that are the sum of weights for different circuits in superposition, for example.

What I am trying to say is that I expect networks to implement computation in superposition by linearly adding many different subcomponents to create W_in, but I mostly do not expect networks to create W_out by linearly adding many different subcomponents that each read-out a particular circuit output back into the residual stream, because that's actually an incredibly noisy operation. I made this mistake at first as well. This post still has a faulty construction for W_out because of my error. Linda Linsefors finally corrected me on this a couple months ago.

As to the issue with the maximum number of components: it seems to me like if you have five sparse features (in something like the SAE sense) in superposition and you apply a rotation (or reflection, or identity transformation) then the important information would be contained in a set of five rank 1 transformations, basically a set of maps from A to B. This doesn't happen for the identity, does it happen for a rotation or reflection?

I disagree that if all we're doing is applying a linear transformation to the entire space of superposed features, rather than, say, performing different computations on the five different features, that it would be desirable to split this linear transformation into the five features.

Finally, as to "introducing noise" by doing things other than a global linear transformation, where have you seen evidence for this? On synthetic (and thus clean) datasets, or actually in real datasets? In real scenarios, your model will (I strongly believe) be set up such that the "noise" between interfering features is actually helpful for model performance, since the world has lots of structure which can be captured in the particular permutation in which you embed your overcomplete feature set into a lower dimensional space.

Uh, I think this would be a longer discussion than I feel up for at the moment, but I disagree with your prediction. I agree that the representational geometry in the model will be important and that it will be set up to help the model, but interference of circuits in superposition cannot be arranged to be helpful in full generality. If it were, I would take that as pretty strong evidence that whatever is going on in the model is not well-described by the framework of superposition at all.

If you have 100 orthogonal linear probes to read with, yes. But since there’s only 50 neurons, the actual circuits for different input features in the network will have interference to deal with.

My understanding is that SPD cannot decompose an matrix into more than $max (n, m)$ subcomponents, and if all subcomponents are "live" i.e. active on a decent fraction of the inputs, then it will have to have $max (n, m)$ components to work

SPD can decompose an $n \times m$ matrix into more than $max (n, m)$ subcomponents.

I guess there aren't any toy models in this paper that directly showcase this, but I'm pretty confident it's true, because

I don't see why it wouldn't be able to.
I've decomposed a weight matrix in a tiny LLM and got out way more than $max (n, m)$ live subcomponents. That's a very preliminary result though, you probably shouldn't put that much stock in it.

Edit: as you pointed out, this might only apply when there's not a nonlinearity after the weight. But every $W_{o u t}$ in a transformer has a connection running from it directly to the output logits through $W_{u n e m b e d}$ . So SPD will struggle to interpret any of the output weights of transformer MLPs. This seems bad.

I think it's the other way around. If you try to implement computation in superposition in a network with a residual stream, you will find that about the best thing you can do with the $W_{out}$ is often to just use it as a global linear transformation. Most other things you might try to do with it drastically increases noise for not much pay-off. In the cases where networks are doing that, I would want SPD to show us this global linear transform.

But $W_{i n}$ is reading those vectors off a 1000-dimensional vector space where there's no interference between features.

They're embedded randomly in the space, so there is interference between them in the sense of them having non-zero inner products.

Thanks to CCi(p)nCiS, we know that the toy model is not even doing computation in superposition, which is the case which SPD seems to be based on. It's actually doing something really weird with the "noise", which doesn't actually behave well.

Yes. I agree that this makes the model not as great a testbed as we originally hoped.

Since you're working on just one weight at a time, linear transformations are the only case to consider. So all you can do is either find an exact basis or find the whole linear transformation.

No, that's not how it works

Networks have non-linearities. SPD will decompose you a matrix into a single linear transformation if what the network is doing with that matrix really is just applying one global linear transformation. If e.g. there are non-linearities right after the matrix that aren't just always switched on, SPD will usually decompose the matrix into many sub-components.^[1]
I'm not sure what you mean by 'working on just one weight at a time'. The stochastic-layerwise reconstruction loss does do forward passes replacing only one matrix in the network at a time with a randomly ablated version of the same matrix. But the stochastic reconstruction loss does forward passes replacing all matrices at once.
But I think I must be misunderstanding what you mean here, because even if we didn't have the stochastic reconstruction loss I don't see how that would matter for this.

In fact, if we have superposition, I would expect the relevant components of the model to sum to more than the weights of the model. This is kind of just what superposition means, the same weights are being used for multiple computations at once.

That's not how it works in our existing framework for circuits in superposition. The weights for particular circuits there actually literally do sum to the weights of the whole network. I've been unable to come up with a general framework that doesn't exhibit this weight linearity.

This is kind of just what superposition means, the same weights are being used for multiple computations at once.

I wouldn't say that? Computation in superposition inevitably involves different circuits interfering with each other, because the weights of one circuit have non-zero inner product with the activations of another. But there is still a particular set of vectors in parameter space such that each vector implements one circuit.

Superposition can give you an overcomplete basis of variables in activation space, but it cannot give you an overcomplete basis of circuits acting on these variables in parameter space. There can't be more circuits than weights.

^{^}
Well, depending on what the network is actually computing with these non-linearities, of course. If it's not computing many different things, or not using the results of many of the computations for anything downstream, SPD probably won't find many components that ever activate.

@Eliezer Yudkowsky If Large Language Models were confirmed to implement computation in superposition [1,2,3], rather than just representation in superposition, would you resolve this market as yes?

Representation in superposition would not have been a novel idea to computer scientists in 2006. Johnson-Lindenstrauss is old. But there's nothing I can think of from back then that'd let you do computation in superposition, linearly embedding a large number of algorithms efficiently on top each other in the same global vector space so they can all be pretty efficiently executed in parallel, without wasting a ton of storage and FLOP, so long as only a few algorithms do anything at any given moment.

To me at least, that does seem like a new piece of the puzzle for how minds can be set up to easily learn lots of very different operations and transformations that all apply to representations living in the same global workspace.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments