Note: This is an exposition of the recent preprint "Monet: Mixture of Monosemantic Experts for Transformers". I wrote this exposition as my project for the ARBOx program, I was not involved in writing the paper. Any errors are my own. Thank you to @David Quarel for his excellent comments and suggestions.
TL;DR: MONET is a novel neural network architecture that achieves interpretability by design rather than through post-hoc analysis. Using a specialized Mixture of Experts (MoE) architecture with ~250k experts per layer, MONET encourages individual components to learn specific, interpretable tasks (e.g., Python coding, biology knowledge, or toxicity generation). This enables selective removal of capabilities without harming performance in other domains.
I show that MONET... (read 3146 more words →)