Ah yes! I tried doing exactly this to produce a sort of 'logit lens' to explain the SAE features. In particular I tried the following.
Thanks for the comments! I am also surprised that SAEs trained on these vision models seem to require such little data. Especially as I would have thought the complexity of CLIP's representations for vision would be comparable to the complexity for text (after all we can generate an image from a text prompt, and then use a captioning model to recover the text suggesting most/all of the information in the text is also present in the image).
With regards to the model loss, I used the text template “A photo of a {label}.”, where {label} is the ImageNet text la...
Thanks for the feedback! Yeah I was also surprised SAEs seem to work on ViTs pretty much straight out of the box (I didn't even need to play around with the hyper parameters too much)! As I mentioned in the post I think it would be really interesting to train on a much larger (more typical) dataset - similar to the dataset the CLIP model was trained on.
I also agree that I probably should have emphasised the "guess the image" game as a result rather than an aside, I'll bare that in mind for future posts!
Awesome work! I couldn’t find anywhere that specified how sparse your SAE is (eg your l0). I would be interested to hear what l0 you got!