A quick question: any reason to use CLIP embeddings as the SAE input, instead of directly using the images themselves? I understand that the goal is to understand CLIP inner workings, but curious if you have intuitions on whether directly feeding in images would work as well.

Reply