All of jake_mendel's Comments + Replies

Fair point. I guess I still want to say that there's a substantial amount of 'come up with new research agendas' (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don't feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me en... (read more)

If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like "coming up with research agendas"? Like, most people (in AIS) don't seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don't really even have any good ideas for how to do that that haven't been tested. Therefore if we have ver... (read more)

And also on optimism that people are not using these controlled AIs that can come up with new research agendas and new ideas to speed up ASI research just as much. 

Without some kind of pause agreement, you are just making the gap between alignment and ASI research not grow even larger even faster than it already is compared to the counterfactual of capabilities researchers adopting AIs that 10x general science speed and alignment researchers not doing that. You are not actually closing the gap and making alignment research finish before ASI development when it counterfactually wouldn't have in a world where nobody used pre-ASI AIs to speed up any kind of research at all.

I don't really agree. The key thing is that I think an exit plan of trustworthy AIs capable enough to obsolete all humans working on safety (but which aren't superintelligent) is pretty promising. Yes, these AIs might need to think of novel breakthroughs and new ideas (though I'm also not totally confident in this or that this is the best route), but I don't think we need new research agendas to substantially increase the probability these non-superintelligent AIs are well aligned (e.g., don't conspire against us and pursue our interests in hard open ended... (read more)

[edit: I'm now thinking that actually the optimal probe vector is also orthogonal to  so maybe the point doesn't stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.] 

Yes, I'm calling the representation vector the same as the probing vector. Suppose my activation vector can be written as  where  are feature values... (read more)

3Nina Panickssery
Makes sense - agreed!

A thought triggered by reading issue 3:

I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is: 

It seems like there is a spectrum of possible views you could have here:

  1. It's achievable to come up with sensible ansatzes (sparsity, linear representations,
... (read more)
3Lucius Bushnaq
Sure, it's possible in principle to notice that there is a subspace that can be represented factored into a direct sum. But how do you tell whether you in fact ought to represent it in that way, rather than as composed features, to match the features of the model? Just because the compositional structure is present in the activations doesn't mean the model cares about it.    I agree that it is not a knockdown argument. That is why the title isn't "Activation space interpretability is doomed." 

Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a 'model feature' or a 'dataset feature'. You can:

  • Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:

    • the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
    • It's possible that the feature you found is causally upstre
... (read more)
3Nina Panickssery
I don't understand this. If a feature is represented by a direction v in the activations, surely the best probe for that feature will also be v because then <v,v> is maximized. 

Strong upvoted. I think the idea in this post could (if interpreted very generously) turn out to be pretty important for making progress at the more ambitious forms of interpretability. If we/the ais are able to pin down more details about what constitutes a valid learning story or a learnable curriculum, and tie that to the way gradient updates can be decomposed into signal on some circuit and noise on the rest of the network, then it seems like we should be able to understand each circuit as it corresponds to the endpoint of a training story, and each pa... (read more)

8Dmitry Vaintrob
Thanks! I definitely believe this, and I think we have a lot of evidence for this in both toy models and LLMs (I'm planning a couple of posts on this idea of "training stories"), and also theoretical reasons in some contexts. I'm not sure how easy it is to extend the specific approach used in the proof for parity to a general context. I think it inherently uses the fact of orthogonality of Fourier functions on boolean inputs, and understanding other ML algorithms in terms of nice orthogonal functions seems hard to do rigorously, unless you either make some kind of simplifying "presumption of independence" model on learnable algorithms or work in a toy context. In the toy case, there is a nice paper that does exactly this (explains how NN's will tend to find "incrementally learnable" algorithms), by using a similar idea to the parity proof I outlined. This is the leap complexity paper (that Kaarel and I have looked into; I think you've also looked into related things)

I either think this is wrong or I don’t understand.

What do you mean by ‘maximising compounding money?’ Do you mean maximising expected wealth at some specific point in the future? Or median wealth? Are you assuming no time discounting? Or do you mean maximising the expected value of some sort of area under the curve of wealth over time?

I’m not sure I understand your question, but are you asking ‘in what sense are there two networks in series rather than just one deeper network’? The answer to that would be: parts of the inputs to a later small network could come from the outputs of many earlier small networks. Provided the later subnetwork is still sparsely used, it could have a different distribution of when it is used to any particular earlier subnetwork. A classic simple example is how the left-orientation dog detector and the right-orientation dog detector in InceptionV1 fire sort of independently, but both their outputs are inputs to the any-orientation dog detector (which in this case is just computing an OR).

1lewis smith
yeah that makes sense I think

I keep coming back to the idea of interpreting the embedding matrix of a transformer. It’s appealing for several reasons: we know the entire data distribution is just independent probabilities of each logit, so there’s no mystery about what features are data features vs model features. We also know one sparse basis for the activations: the rows of the embedding. But that’s also clearly not satisfactory because the embedding learns something! The thing it learns could be a sparse overbasis of non-token features, but the story for this would have to be diffe... (read more)

1bilalchughtai
Tangentially relevant: this paper by Jacob Andreas' lab shows you can get pretty far on some algorithmic tasks by just training a randomly initialized network's embedding parameters. This is in some sense the opposite to experiment 2.
3Daniel Murfet
I hadn't seen that Wattenberg-Viegas paper before, nice.

[edit: stefan made the same point below earlier than me]

Nice idea! I’m not sure why this would be evidence for residual networks being an ensemble of shallow circuits — it seems more like the opposite to me? If anything, low effective layer horizon implies that later layers are building more on the outputs of intermediate layers.  In one extreme, a network with an effective layer horizon of  would only consist of circuits that route through every single layer. Likewise, for there to be any extremely shallow circuits that route directly from... (read more)

Yeah this does seem like its another good example of what I'm trying to gesture at. More generally, I think the embedding at layer 0 is a good place for thinking about the kind of structure that the superposition hypothesis is blind to. If the vocab size is smaller than the SAE dictionary size, an SAE is likely to get perfect reconstruction and  by just learning the vocab_size many embeddings. But those embeddings aren't random! They have been carefully learned and contain lots of useful information. I think trying to explain the structure in... (read more)

I'm very unsure about this (have thought for less than 10 mins etc etc) but my first impression is that this is tentative evidence in favour of SAEs doing sensible things. In my model (outlined in our post on computation in superposition) the property of activation vectors that matters is their readoffs in different directions: the value of their dot product with various different directions in a readoff overbasis. Future computation takes the values of these readoffs as inputs, and it can only happen in superposition with an error correcting mechanism for... (read more)

9wesg
This is a great comment! The basic argument makes sense to me, though based on how much variability there is in this plot, I think the story is more complicated. Specifically, I think your theory predicts that the SAE reconstructed KL should always be out on the tail, and these random perturbations should have low variance in their effect on KL. I will do some follow up experiments to test different versions of this story.

I think I agree that SLT doesn't offer an explanation of why NNs have a strong simplicity bias, but I don't think you have provided an explanation for this either?

Here's a simple story for why neural networks have a bias to functions with low complexity (I think it's just spelling out in more detail your proposed explanation):

Since the Kolmogorov complexity of a function  is (up to a constant offset) equal to the minimum description length of the function, it is upper bounded by any particular way of describing the function, including by firs... (read more)

Someone suggested this comment was inscrutable so here's a summary:

I don't think that how argmax-y softmax is being is a crux between us - we think our picture makes the most sense when softmax acts like argmax or top-k so we hope you're right that softmax is argmax-ish. Instead, I think the property that enables your efficient solution is that the set of features 'this token is token (i)' is mutually exclusive, ie. only one of these features can activate on an input at once. That means that in your example you don't have to worry about how to recover... (read more)

jake_mendel*Ω695

Thanks for the comment!

In more detail:

In our discussion of softmax (buried in part 1 of section 4), we argue that our story makes the most sense precisely when the temperature is very low, in which case we only attend to the key(s) that satisfy the most skip feature-bigrams. Also, when features are very sparse, the number of skip feature bigrams present in one query-key pair is almost always 0 or 1, and we aren't trying to super precisely track whether its, say, 34 or 35.

I agree that if softmax is just being an argmax, then one implication is that we don't... (read more)

8jake_mendel
Someone suggested this comment was inscrutable so here's a summary: I don't think that how argmax-y softmax is being is a crux between us - we think our picture makes the most sense when softmax acts like argmax or top-k so we hope you're right that softmax is argmax-ish. Instead, I think the property that enables your efficient solution is that the set of features 'this token is token (i)' is mutually exclusive, ie. only one of these features can activate on an input at once. That means that in your example you don't have to worry about how to recover feature values when multiple features are present at once. For more general tasks implemented by an attention head, we do need to worry about what happens when multiple features are present at the same time, and then we need the f-vectors to form a nearly orthogonal basis and your construction becomes a special case of ours I think.

So, all our algorithms in the post are hand constructed with their asymptotic efficiency in mind, but without any guarantees that they will perform well at finite . They haven't even really been optimised hard for asymptotic efficiency - we think the important point is in demonstrating that there are algorithms which work in the large  limit at all, rather than in finding the best algorithms at any particular  or in the limit.  Also, all the quantities we talk about are at best up to constant factors which would be importan... (read more)

Thanks for the kind feedback!

 I'd be especially interested in exploring either the universality of universal calculation

Do you mean the thing we call genericity in the further work section? If so, we have some preliminary theoretical and experimental evidence that genericity of U-AND is true. We trained networks on the U-AND task and the analogous U-XOR task, with a narrow 1-layer MLP and looked at the size of the interference terms after training with a suitable loss function. Then, we reinitialised and froze the first layer of weights and biases, al... (read more)

1Hoagy
How are you setting p when d0=100? I might be totally misunderstanding something but log2(d0)/√d≈2.12  at d0=d=100 - feels like you need to push d up towards like 2k to get something reasonable? (and the argument in 1.4 for using 1log2d0 clearly doesn't hold here because it's not greater than log2d0d1/kfor this range of values).
2LawrenceC
Fascinating, thanks for the update!