jake_mendel

technical AI safety program associate at OpenPhil

Wiki Contributions

Comments

Sorted by

Fair point. I guess I still want to say that there's a substantial amount of 'come up with new research agendas' (or like sub-agendas) to be done within each of your bullet points, but I agree the focus on getting trustworthy slightly superhuman AIs and then not needing control anymore makes things much better. I also do feel pretty nervous about some of those bullet points as paths to placing so much trust in your AI systems that you don't feel like you want to bother controlling/monitoring them anymore, and the ones that seem further towards giving me enough trust in the AIs to stop control are also the ones that seem to have the most very open research questions (eg EMs in the extreme case). But I do want to walk back some of the things in my comment above that apply only to aligning very superintelligent AI.

If you are (1) worried about superintelligence-caused x-risk and (2) have short timelines to both TAI and ASI, it seems like the success or failure of control depends almost entirely on getting the early TAIS to do stuff like "coming up with research agendas"? Like, most people (in AIS) don't seem to think that unassisted humans are remotely on track to develop alignment techniques that work for very superintelligent AIs within the next 10 years — we don't really even have any good ideas for how to do that that haven't been tested. Therefore if we have very superintelligent AIs within the next 10 years (eg 5y till TAI and 5y of RSI), and if we condition on having techniques for aligning them, then it seems very likely that these techniques depend on novel ideas and novel research breakthroughs made by AIs in the period after TAI is developed. It's possible that most of these breakthroughs are within mechinterp or similar, but that's a pretty lose constraint, and 'solve mechinterp' is really not much more of a narrow, well-scoped goal than 'solve alignment'. So it seems like optimism about control rests somewhat heavily on optimism that controlled AIs can safely do things like coming up with new research agendas.

[edit: I'm now thinking that actually the optimal probe vector is also orthogonal to  so maybe the point doesn't stand. In general, I think it is probably a mistake to talk about activation vectors as linear combinations of feature vectors, rather than as vectors that can be projected into a set of interpretable readoff directions. see here for more.] 

Yes, I'm calling the representation vector the same as the probing vector. Suppose my activation vector can be written as  where  are feature values and  are feature representation vectors. Then the probe vector which minimises MSE (explains most of the variance) is just . To avoid off target effects, the vector  you want to steer with for feature  might be the vector that is most 'surgical': it only changes the value of this feature and no other features are changed. In that case it should be the vector that lies orthogonal to  which is only the same as  if the set  are orthogonal. 

Obviously I'm working with a non-overcomplete basis of feature representation vectors here. If we're dealing with the overcomplete case, then it's messier. People normally talk about 'approximately orthogonal vectors' in which case the most surgical steering vector  but (handwaving) you can also talk about something like 'approximately linearly independent vectors' in which case my point stands I think (note that SAE decoder directions are definitely not approximately orthogonal). For something less handwavey see this appendix.

A thought triggered by reading issue 3:

I agree issue 3 seems like a potential problem with methods that optimise for sparsity too much, but it doesn't seem that directly related to the main thesis? At least in the example you give, it should be possible in principle to notice that the space can be factored as a direct sum without having to look to future layers. I guess what I want to ask here is: 

It seems like there is a spectrum of possible views you could have here:

  1. It's achievable to come up with sensible ansatzes (sparsity, linear representations, if we see the possibility to decompose the space into direct sums then we should do that, and so on) which will get us most of the way to finding the ground truth features, but there are edge cases/counterexamples which can only be resolved by looking at how the activation vector is used. this is compatible with the example you gave in issue 3 where the space is factorisable into a direct sum which seems pretty natural/easy to look for in advance, although of course that's the reason you picked that particular structure as an example.
  2. There are many many ways to decompose an activation vector, corresponding to many plausible but mutually incompatible sets of ansatzes, and the only way to know which is correct for the purposes of understanding the model is to see how the activation vector is used in the later layers.
    1. Maybe there are many possible decompositions but they are all/mostly straightforwardly related to each other by eg a sparse basis transformation, so finding any one decomposition is a step in the right direction.
    2. Maybe not that.
  3. Any sensible approach to decomposing an activation vector without looking forward to subsequent layers will be actively misleading. The right way to decompose the activation vector can't be found in isolation with any set of natural ansatzes because the decomposition depends intimately on the way the activation vector is used.

The main strategy being pursued in interpretability today is (insofar as interp is about fully understanding models):

  • First decompose each activation vector individually. Then try to integrate the decompositions of different layers together into circuits. This may require merging found features into higher level features, or tweaking the features in some way, or filtering out some features which turn out to be dataset features. (See also superseding vs supplementing superposition).

This approach is betting that the decompositions you get when you take each vector in isolation are a (big) step in the right direction, even if they require modification, which is more compatible with stance (1) and (2a) in the list above. I don't think your post contains any knockdown arguments that this approach is doomed (do you agree?), but it is maybe suggestive. It would be cool to have some fully reverse engineered toy models where we can study one layer at a time and see what is going on. 

Nice post! Re issue 1, there are a few things that you can do to work out if a representation you have found is a 'model feature' or a 'dataset feature'. You can:

  • Check if intervening on the forward pass to modify this feature produces the expected effect on outputs. Caveats:

    • the best vector for probing is not the best vector for steering (in general the inverse of a matrix is not the transpose, and finding a basis of steering vectors from a basis of probe vectors involves inverting the basis matrix)
    • It's possible that the feature you found is causally upstream of some features the model has learned, and even if the model hasn't learned this feature, changing it affects things the model is aware of. OTOH, I'm not sure whether I want to say that this feature has not been learned by the model in this case.
    • Some techniques eg crosscoders don't come equipped with a well defined notion of intervening on the feature during a forward pass.

    Nonetheless, we can still sometimes get evidence this way, in particular about whether our probe has found subtle structure in the data that is really causally irrelevant to the model. This is already a common technique in interpretability (see eg the initimitable golden gate claude, and many more systematic steering tests like this one),

  • Run various shuffle/permutation controls:
    • Measure the selectivity of your feature finding technique: replace the structure in the data with some new structure (or just remove the structure) and then see if your probe finds that new structure. To the extent that the probe can learn the new structure, it is not telling you about what the model has learned.
      Most straightforwardly: if you have trained a supervised probe, you can train a second supervised probe on a dataset with randomised labels, and look at how much more accurate the probe is when trained on data with true labels. This can help distinguish between the hypothesis that you have found a real variable in the model, and the null hypothesis that the probing technique is powerful enough to find a direction that can classify any dataset with that accuracy. Selectivity tests should do things like match the bias of the train data (eg if training a probe on a sparsely activating feature, then the value of the feature is almost always zero and that should be preserved in the control).
      You can also test unsupervised techniques like SAEs this way by training them on random sequences of tokens. There's probably more sophisticated controls that can be introduced here: eg you can try to destroy all the structure in the data and replace it with random structure that is still sparse in the same sense, and so on.
    • In addition to experiments that destroy the probe training data, you can also run experiments that destroy the structure in the model weights. To the extent that the probe works here, it is not telling you about what the model has learned.
      For example, reinitialise the weights of the model, and train the probe/SAE/look at the PCA directions. This is a weak control: a stronger control could do something like reiniatialising the weights of the model that matches the eigenspectrum of each weight matrix to the eigenspectrum of the corresponding matrix in the trained model (to rule out things like the SAE didn't work in the randomised model because the activation vector is too small etc), although that control is still quite weak.
      This control was used nicely in Towards Monosemanticity here, although I think much more research of this form could be done with SAEs and their cousins.
    • I am told by Adam Shai that in experimental neuroscience, it is something of a sport to come up with better and better controls for testing the hypothesis that you have identified structure. Maybe some of that energy should be imported to interp?
  • Probably some other things not on my mind right now??

I am aware that there is less use in being able to identify whether your features are model features or dataset features than there is in having a technique that zero-shot identifies model features only. However, a reliable set of tools for distinguishing what type of feature we have found would give us feedback loops that could help us search for good feature-finding techniques. eg. good controls would give us the freedom to do things like searching over (potentially nonlinear) probe architectures for those with a high accuracy relative to the control (in the absence of the control, searching over architectures would lead us to more and more expressive nonlinear probes that tell us nothing about the model's computation). I'm curious if this sort of thing would lead us away from treating activation vectors in isolation, as the post argues.

Strong upvoted. I think the idea in this post could (if interpreted very generously) turn out to be pretty important for making progress at the more ambitious forms of interpretability. If we/the ais are able to pin down more details about what constitutes a valid learning story or a learnable curriculum, and tie that to the way gradient updates can be decomposed into signal on some circuit and noise on the rest of the network, then it seems like we should be able to understand each circuit as it corresponds to the endpoint of a training story, and each part of the training story should correspond to a simple modification of the circuit to add some more complexity. this is potentially better for interpretability than if it were easy for networks to learn huge chunks of structure all at once. How optimistic are you about there being general insights to be had about the structures of learnable curricula and their relation to networks' internal structure?

I either think this is wrong or I don’t understand.

What do you mean by ‘maximising compounding money?’ Do you mean maximising expected wealth at some specific point in the future? Or median wealth? Are you assuming no time discounting? Or do you mean maximising the expected value of some sort of area under the curve of wealth over time?

I’m not sure I understand your question, but are you asking ‘in what sense are there two networks in series rather than just one deeper network’? The answer to that would be: parts of the inputs to a later small network could come from the outputs of many earlier small networks. Provided the later subnetwork is still sparsely used, it could have a different distribution of when it is used to any particular earlier subnetwork. A classic simple example is how the left-orientation dog detector and the right-orientation dog detector in InceptionV1 fire sort of independently, but both their outputs are inputs to the any-orientation dog detector (which in this case is just computing an OR).

I keep coming back to the idea of interpreting the embedding matrix of a transformer. It’s appealing for several reasons: we know the entire data distribution is just independent probabilities of each logit, so there’s no mystery about what features are data features vs model features. We also know one sparse basis for the activations: the rows of the embedding. But that’s also clearly not satisfactory because the embedding learns something! The thing it learns could be a sparse overbasis of non-token features, but the story for this would have to be different to the normal superposition story which involves features being placed into superposition by model components after they are computed (I find this story suss in other parts of the model too).
SAEs trained on the embedding do pretty well, but the task is much easier than in other layers because the dataset is deceptively small. Nonetheless if the error was exactly zero, this would mean that a sparse overbasis is certainly real here (even if not the full story). If the error were small enough we may want to conclude that this is just training noise. Therefore I have some experiment questions that would start this off:

  • Since the dataset of activations is so small, we can probably afford to do full basis pursuit (probably with some sort of weightings for token frequencies). How small does the error get? How does this scale with pretraining checkpoint? Ie is the model trying to reduce this noise? Presumably a UMAP of basis directions shows semantic clusters like with every SAE, implying there is more structure to investigate, but it would be super cool if that weren't the case.
  • How much interesting stuff is actually contained in the embedding? If we randomise the weights of the embedding (perhaps with rejection sampling to avoid rows being too high cosine sim) and pretrain gpt2 from scratch without ever updating the embedding weights, how much worse does training go? What about if we update one row of the embedding of gpt2 at a time to random and finetune?

If we find that 1) random embeddings do a lot worse and 2) basis pursuit doesn’t lead to error nodes that tend to zero over training, then we’re in business: the embedding matrix contains important structure that is outside the superposition hypothesis. Is matrix binding going on? Are circles common? WHAT IS IT

[edit: stefan made the same point below earlier than me]

Nice idea! I’m not sure why this would be evidence for residual networks being an ensemble of shallow circuits — it seems more like the opposite to me? If anything, low effective layer horizon implies that later layers are building more on the outputs of intermediate layers.  In one extreme, a network with an effective layer horizon of  would only consist of circuits that route through every single layer. Likewise, for there to be any extremely shallow circuits that route directly from the inputs to the final layer, the effective layer horizon must be the number of layers in the network.

I do agree that low layer horizons would substantially simplify (in terms of compute) searching for circuits.

Load More