I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary and open-source LLMs that was quite popular this year and produced great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind[5] with the humble but open source Llama 3.2-3B model.
The project provides a complete end-to-end pipeline for training Sparse Autoencoders to interpret LLM features, from activation capture through training, interpretation, and verification. All code, data, trained models, and detailed documentation are publicly available in my attempt to make this as open research as possible, though calling it an extensively documented personal project wouldn't be wrong either in my opinion.
Since LessWrong has a strong focus on AI interpretability research, I thought some of you might find value in this open research replication. I'm happy to answer any questions about the methodology, results, or future directions.
Hi Neel,
you're absolutely right, all research in the gemmascope paper was performed on the open source Gemma 2 model. I wanted to group up all research that my paper was based on in a concise sentence and by doing so erroneously put you in the 'proprietary LLMs' section. I went ahead and corrected the mistake.
My apologies.
I hope you still enjoyed the project and thank you for your great research work at DeepMind. =)