User Comment Replies

I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.

Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).

If our current interp techniques can help us understand these pheno... (read more)

1Sheikh Abdur Raheem Ali1d

Thanks for doing this— I found it really helpful.

3jake_mendel3d

Very happy you did this!

4Neel Nanda3d

Really helpful work, thanks a lot for doing it

StefanHex's Shortform

Josh Engels3mo235

I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.

https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309

I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here:

https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566

When I used yo... (read more)

2StefanHex3mo

You're right. I forgot subtracting the mean. Thanks a lot!! I'm computing new numbers now, but indeed I expect this to explain my result! (Edit: Seems to not change too much)

StefanHex's Shortform

Josh Engels3mo50

I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.

Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

StefanHex's Shortform

Josh Engels3mo10

What do you mean you’re encoding/decoding like normal but using the k means vectors? Shouldn’t the SAE training process for a top k SAE with k = 1 find these vectors then?

In general I’m a bit skeptical that clustering will work as well on larger models, my impression is that most small models have pretty token level features which might be pretty clusterable with k=1, but for larger models many activations may belong to multiple “clusters”, which you need dictionary learning for.

4StefanHex3mo

So I do something like latents_tmp = torch.einsum("bd,nd->bn", data, centroids) max_latent = latents_tmp.argmax(dim=-1) # shape: [batch] latents = one_hot(max_latent) where the first line is essentially an SAE embedding (and centroids are the features), and the second/third line is a top-k. And for reconstruction do something like recon = centroids @ latents which should also be equivalent. Yes I would expect an optimal k=1 top-k SAE to find exactly that solution. Confused why k=20 top-k SAEs to so badly then. If this is a crux then a quick way to prove this would be for me to write down encoder/decoder weights and throw them into a standard SAE code. I haven't done this yet.

LESSWRONG
LW

All of Josh Engels's Comments + Replies