x

LESSWRONG
LW

Dmitrii Kharlapenko — LessWrong

Dmitrii Kharlapenko

Top postsTop post

Self-explaining SAE features

TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. * The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia’s auto-interp labels (with the caveat that Neuronpedia’s auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison). * We aren’t confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating dataset examples are needed, and it’s cheaper as you just run the model being studied (eg Gemma 2B) not a larger model like GPT-4. * Further, it has different errors to auto-interp, so finding and reading both may be valuable for researchers in practice. * We provide advice for using self-explanation in practice, in particular for the challenge of automatically choosing the right scale, which significantly affects explanation quality. * We also release a tool for you to work with self-explanation. * We hope the technique is useful to the community as is, but expect there’s many optimizations and improvements on top of what is in this post. Introduction This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy. SAE features promise a flexible and extensive framework for interpretation of LLM internals. Recent work (like Scaling Monosemanticity) has shown that they are capable of capturing even high-level abstract concepts inside the model. Compared to MLP neurons, they can capture many more interesting concepts. Unfortunately, in order to learn things with SAE features and interpret what the SAE tells us, one needs to first interpret these features

Extracting SAE task features for in-context learning

SAE features for refusal and sycophancy steering vectors

Evolutionary prompt optimization for SAE feature visualization

Dmitrii Kharlapenko

Message

114

Ω

45

2

2

2y

Dmitrii Kharlapenko

Evolutionary prompt optimization for SAE feature visualization

TLDR: * Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method...

Nov 14, 2024•22

SAE features for refusal and sycophancy steering vectors

TL;DR * Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors. * We apply the gradient pursuit algorithm suggested by Smith et al to decompose steering vectors, and find that they contain many interpretable and...

Oct 12, 2024•29

Extracting SAE task features for in-context learning

TL;DR * We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going...

Aug 12, 2024•31

Self-explaining SAE features

TL;DR * We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. * The natural...

Aug 5, 2024•62

Message

114 karma

45 alignment forum karma

2 posts

2 comments

Member for 2 years

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda, Arthur Conmy+ 0 more

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda, Arthur Conmy

1y

TLDR:

Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method can also help interpret SAE features.
We reimplement the algorithm in the paper, adapt it based on suggestions from the authors and apply it to some interesting residual stream SAE latents in GPT-2 Small and Gemma 1 2B. The prompts mostly match the kind of text one would expect to activate a feature from looking at the max activating examples. We find prompts that activate features

... (read 2506 more words →)

22

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy, Neel Nanda+ 0 more

neverix, Dmitrii Kharlapenko, Arthur Conmy, Neel Nanda

1y

TL;DR

Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors.
We apply the gradient pursuit algorithm suggested by Smith et al to decompose steering vectors, and find that they contain many interpretable and promising-looking features. This builds off our prior work, which applies ITO and derivatives to steering vectors with less success.
Notably, we find several abstract features in Phi-3 Mini for refusal and steer with linear combinations of these features. We find interesting features when decomposing sycophancy steering vectors and apply the same technique to MELBO vectors with mixed success (an unsupervised technique that finds vectors that cause significant changes in model

... (read 1819 more words →)

4

29

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko

Dmitrii Kharlapenko, neverix, Neel Nanda, Arthur Conmy+ 0 more

Dmitrii Kharlapenko, neverix, Neel Nanda, Arthur Conmy

2y

TL;DR

We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going off distribution.
We explored the algorithm of gradient pursuit suggested in Smith et al, but it didn’t work for us without modifications.
Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup. This exploits the fact that task vectors have a differentiable objective. We find that this gives a sparser and cleaner reconstruction, which is also highly interpretable,

... (read 2611 more words →)

1

31

Replying toSelf-explaining SAE features

Dmitrii Kharlapenko2y

Self-explaining SAE features

Do you mean SAE encoder weights by input features? We did not look into them.

1

2

0

Replying toSelf-explaining SAE features

Dmitrii Kharlapenko2y

Self-explaining SAE features

Thanks! We did try to use it in the repeat setting to make the model produce more than a single token, but it did not work well.

And as far as I remember it also did not improve the meaning prompt much.

4

0

1

1

Self-explaining SAE features

Dmitrii Kharlapenko

Dmitrii Kharlapenko, neverix, Neel Nanda, Arthur Conmy+ 0 more

Dmitrii Kharlapenko, neverix, Neel Nanda, Arthur Conmy

2y

TL;DR

We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation.
The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples. We show that our method is effective, and comparable with Neuronpedia’s auto-interp labels (with the caveat that Neuronpedia’s auto-interp used the comparatively weak GPT-3.5 so this is not a fully fair comparison).
We aren’t confident you should use our method over auto-interp, but we think in some situations it has advantages: no max activating

... (read 2965 more words →)

13

62

1

1