User Comment Replies

One-shot steering vectors cause emergent misalignment, too

Thanks for reading through the post! Let me try and respond to your questions:

We found that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings?

Your explanation largely agrees with my thinking: when you limit yourself to optimizing merely a steering vector (instead of a LoRA, let alone full finetuning), you're imposing such great regularization that it'll be much harder to learn less-generalizing solutions.

However, one other piece of the puzzle might be specific to how we optimize thes... (read more)

1Jan Betley7h

Cool! Thx for all the answers, and again thx for running these experiments : ) (If you ever feel like discussing anything related to Emergent Misalignment, I'll be happy to - my email is in the paper).

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Jacob Dunefsky2mo30

This is very cool work!

One question that I have is whether JSAEs still work as well on models trained with gated MLP activation functions (e.g. ReGLU, SwiGLU). I ask this because there is evidence suggesting that transcoders don't work as well on such models (see App. B of the Gemmascope paper; I also have some unpublished results that I'm planning to write up that further corroborate this). It thus might be the case that the same greater representational capacity of gated activation functions causes both transcoders and JSAEs to be unable to learn sparse ... (read more)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky6mo10

I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we're wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)...

This seems reasonable enough to me. For what it's worth, the other main reason why I'm particularly interested in whether different SAEs' rate-distortion curves in... (read more)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky6mo10

Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.

Yep, that's completely true. Thanks for the reminder!

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky7mo20

Really cool stuff! Evaluating SAEs based on the rate-distortion tradeoff is an extremely sensible thing to do, and I hope to see this used in future research.

One minor question/idea: have you considered quantizing different features’ activations differently? For example, one might imagine that some features are only binary (i.e. is the feature on or off) while others’ activations might be used by the model in a fine-grained way. Quantizing different features differently would be a way to exploit this to reduce the entropy. (Of course, performing this optim... (read more)

2Michael Pearce7mo

On the question of quantizing different feature activations differently: Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies. In our methodology, the effective float precision matters because it sets the bin width for the histogram of a feature's activations that is then used to compute the entropy. We used the same effective float precision for all features, which was found by rounding activations to different precisions until the reconstruction or cross-entropy loss is changed by some amount.

2Kola Ayonrinde7mo

Yeah, we hope others take on this approach too! Stay tuned for our upcoming work 👀 This is an interesting perspective - my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check! With the constant bit-rate version, then I do expect that we would see something like this, though we haven't rigorously tested that hypothesis. I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we're wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)... Or maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don't have to be all things to all people! I'd be interested to hear any opposing views that we really might want many SAEs at different resolutions though* Thanks for your questions and thoughts, we're really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be "atomic". I'm hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky1y30

Just started playing around with this -- it's super cool! Thank you for making this available (and so fast!) -- I've got a lot of respect for you and Joseph and the Neuronpedia project.

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky1yΩ260

Do you have any plans of doing something similar for attention layers?

I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!

Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?

I did try something similar at one point, but it didn't quite work out. In particular: gi... (read more)

2Lee Sharkey1y

There's recent work published on this here by Chris Mathwin, Dennis Akar, and me. The gated attention block is a kind of transcoder adapted for attention blocks. Nice work by the way! I think this is a promising direction. Note also the similar, but substantially different, use of the term transcoder here, whose problems were pointed out to me by Lucius. Addressing those problems helped to motivate our interest in the kind of transcoders that you've trained in your work!

LESSWRONG
LW

All of Jacob Dunefsky's Comments + Replies