The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
This is a linkpost for our two recent papers: 1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927 2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928 This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper. A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things: 1. We know that the training loss goes down during training. Thus, the features learned during training must be determined by the loss landscape. We want to use the structure of the loss landscape to identify what the features are and how they are represented. 2. We know that models generalize, i.e. that they learn features from the training data that allow them to accurately predict on the test set. Thus, we want our interpretation to explain this generalization behavior. Generalization has been linked to basin broadness in the loss landscape in several ways, most notably including singular learning theory, which introduces the learning coefficient as a measure of basin broadness that doubles as a measure of generalization error that replaces the parameter count in Occam's razor. Inspired by both of these ideas, the first paper explores using the structure of the loss landscape to find the most computationally natural representation of a network. We focus on identifying parts of the network that are not responsible for low loss (i.e. degeneracy), inspired by singular learning theory. These degeneracies are an obstacle for interpretability as they mean there exist parameters which do no