Exploring the coherence of features explanations in the GemmaScope
AI Alignment course project (BlueDot Impact) Summary Sparse Autoencoders are becoming the to-use technique to get features out of superposition and interpret them in a sparse activation space. Current methods to interpret such features at scale rely heavily on automatic evaluations carried by Large Language Models. However, there is still...