TL;DR We experimentally test the mathematical framework for circuits in superposition by hand-coding the weights of an MLP to implement many conditional[1] rotations in superposition on two-dimensional input features. The code can be found here. This work was supported by Coefficient Giving and Goodfire AI 1 Introduction Figure 1: The...
TL;DR: This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information...
Summary & Motivation This post is a continuation and clarification of Circuits in Superposition: Compressing many small neural networks into one. That post presented a sketch of a general mathematical framework for compressing different circuits into a network in superposition. On closer inspection, some of it turned out to be...
Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...
EDIT 15.12.2025: This post is now superseded by this one. It was just a sketch I wrote up to clarify my own thinking, and contains many mistakes. The newer post is the finished product. I think we may be able to prove that Bayesian learning on transformers[1] or recurrent neural...
Suppose we have a system that we suspect is some form of local optimiser. Maybe it's gradient descent training a neural network. Maybe it’s a neural network doing in-context learning. Maybe it's a mind, because we guess that the operation of minds in general is to an extent well-described as...
This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...