User Comment Replies

Interpreting and Steering Features in Images

Awesome work! I couldn’t find anywhere that specified how sparse your SAE is (eg your l0). I would be interested to hear what l0 you got!

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry10mo11

Ah yes! I tried doing exactly this to produce a sort of 'logit lens' to explain the SAE features. In particular I tried the following.

Take an SAE feature encoder direction and map it directly to the multimodal space to get an embedding.
Pass each of the ImageNet text prompts “A photo of a {label}.” through the CLIP text model to generate the multimodal embeddings for each ImageNet class.
Calculate the cosine similarities between the SAE embedding and the ImageNet class embeddings. Pass this through a softmax to get a probability distribution.
Look at the Imag

... (read more)

4LawrenceC10mo

Huh, that's indeed somewhat surprising if the SAE features are capturing the things that matter to CLIP (in that they reduce loss) and only those things, as opposed to "salient directions of variation in the data". I'm curious exactly what "failing to work" means -- here I think the negative result (and the exact details of said result) are argubaly more interesting than a positive result would be.

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry10mo10

Thanks for the comments! I am also surprised that SAEs trained on these vision models seem to require such little data. Especially as I would have thought the complexity of CLIP's representations for vision would be comparable to the complexity for text (after all we can generate an image from a text prompt, and then use a captioning model to recover the text suggesting most/all of the information in the text is also present in the image).

With regards to the model loss, I used the text template “A photo of a {label}.”, where {label} is the ImageNet text la... (read more)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry10mo32

Thanks for the feedback! Yeah I was also surprised SAEs seem to work on ViTs pretty much straight out of the box (I didn't even need to play around with the hyper parameters too much)! As I mentioned in the post I think it would be really interesting to train on a much larger (more typical) dataset - similar to the dataset the CLIP model was trained on.

I also agree that I probably should have emphasised the "guess the image" game as a result rather than an aside, I'll bare that in mind for future posts!

LESSWRONG
LW

All of hugofry's Comments + Replies