Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary and open-source LLMs that was quite popular this year and produced great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind[5] with the humble but open source Llama 3.2-3B model. The...
Nov 24, 202419