User Comment Replies

Activation space interpretability may be doomed

Thanks for the response! I still think that most of the value of SAEs comes from finding a human-interpretable basis, and most of these problems don't directly interfere with this property. I'm also somewhat skeptical that SAEs actually do find a human-interpretable basis, but that's a separate question.

All the model's features will be linear combinations of activations in the standard basis, so it does have 'the right underlying features, but broken down differently from how the model is actually thinking about them'.

I think this is a fair point. I ... (read more)

Activation space interpretability may be doomed

asher1mo20

This was a really thought-provoking post; thanks for writing it! I thought this was an unusually good attempt to articulate problems with the current interpretability paradigm and do some high-level thinking about what we could do differently. However, I think a few of the specific points are weaker than you make them seem in a way that somewhat contradicts the title of the post. I also may be misunderstanding parts, so please let me know if that’s the case.

Problems 2 and 3 (the learned feature dictionary may not match the model’s feature dictionary,... (read more)

2Lucius Bushnaq1mo

My general issue with most of your counterpoints is that they apply just as much to the standard basis of the network. That is, the neurons in the MLPs, the residual stream activations as they are in Pytorch, etc. . The standard basis represents the activations of the network completely faithfully. It does this even better than techniques like SAEs, which always have some amount of reconstruction error. All the model's features will be linear combinations of activations in the standard basis, so it does have 'the right underlying features, but broken down differently from how the model is actually thinking about them'. Same for all your other points. Theoretically, can you solve problems 1, 2 and 3 with the standard basis by taking information about how the model is computing downstream into account in the right way? Sure. You'd 'take it into account' by finding some completely new basis. Can you solve problem 4 with transcoders? I think vanilla versions would struggle, because the transcoder needs to combine many latents to form an x2, but probably. But our point is that 'piecing together the model's features from its downstram computations' is the whole job of a decomposition. If you have to use information about the model's computations to find the features of the model, you're pretty much conceding that what we call activation space interpretability here doesn't work: I am also skeptical that the techniques you name (e2e SAEs, transcoders, sparse dictionary learning on attributions) suffice to solve all problems in this class in their current form. That would have been a separate discussion beyond the scope of this post though. All we're trying to say here is that you do very likely need to leverage the wider functional structure of the model and incorporate more information about how the model performs computation to decompose the model well.

evhub's Shortform

asher3mo3622

tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.

I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe

... (read more)

8Nathan Helm-Burger3mo

I agree! I contributed to and endorse this Corrigibility plan by Max Harms (MIRI researcher): Corrigibility as Singular Target (See also posts by Seth Herd) I think CAST offers much better safety under higher capabilities and more agentic workflows.

GPT-2's positional embedding matrix is a helix

asher9mo10

Oh shoot, yea. I'm probably just looking at the rotary embeddings, then. Forgot about that, thanks

GPT-2's positional embedding matrix is a helix

asher9mo10

I'm pretty confused; this doesn't seem to happen for any other models, and I can't think of a great explanation.
Has anyone investigated this further?

Here are graphs I made for GPT2, Mistral 7B, and Pythia 14M.
3 dimensions indeed explain almost all of the information in GPT's positional embeddings, whereas Mistral 7B and Pythia 14M both seem to make use of all the dimensions.

[This comment is no longer endorsed by its author]Reply

4Arthur Conmy9mo

Mistral and Pythia use rotary embeddings and don't have a positional embedding matrix. Which matrix are you looking at for those two models?

UFO Betting: Put Up or Shut Up

asher1y10

Is all the money gone by now? I'd be very happy to take a bet if not.

LESSWRONG
LW

All of asher's Comments + Replies