Part 1: Enhancing Inner Alignment in CLIP Vision Transformers: Mitigating Reification Bias with SAEs and Grad ECLIP
Abstract: In this work, we present a methodology that integrates mechanistic and gradient-based interpretability techniques to reduce bias in the CLIP Vision Transformer, focusing on enhancing inner alignment. We begin by defining the CLIP model and reviewing the role of Sparse Autoencoders (SAEs) in mechanistic interpretability, specifically highlighting the use...