Abstract

Paper: Open Problems in Mechanistic Interpretability

Figure 3 in this paper (AtP*, Kramar et al.) illustrates the point nicely: https://arxiv.org/abs/2403.00745

Lee Sharkey

Lee Sharkey, bilalchughtai

TL;DR: This paper brings together ~30 mechanistic interpretability researchers from 18 different research orgs to review current progress and the main open problems of the field.

This review collects the perspectives of its various authors and represents a synthesis of their views by Apollo Research on behalf of Schmidt Sciences. The perspectives presented here do not necessarily reflect the views of any individual author or the institutions with which they are affiliated.

Abstract

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals.

Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about... (read more)

Replying toAttribution-based parameter decomposition

seems great for mechanistic anomaly detection! very intuitive to map ADP to surprise accounting (I was vaguely trying to get at a method like ADP here)

Agree! I'd be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead.

(btw typo: *APD)

Replying toAttribution-based parameter decomposition

IMO most exciting mech-interp research since SAEs, great work

I think so too! (assuming it can be made more robust and scaled, which I think it can)
And thanks! :)

Replying toAttribution-based parameter decomposition

We're aware of model diffing work like this, but I wasn't aware of this particular paper.

It's probably an edge case: They do happen both to be in weight space and to be suggestive of weight space linearity. Indeed, our work was informed by various observations from a range of areas that suggest weight space linearity (some listed here).

On the other hand, our work focused on decomposing a given network's parameters. But the line of work you linked above seems more in pursuit of model editing and understanding the difference between two similar models, rather than decomposing a particular model's weights.

all in all, whether it deserved to be in the related work section is unclear to me. Seems plausible either way. The related work section was already pretty long, but it maybe deserves a section on weight space linearity, though probably not one on model diffing imo.

"Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition".

Lucius Bushnaq

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper:

We introduce a new method for directly decomposing neural network parameters into mechanistic components.

Motivation

At Apollo, we've spent a lot of time thinking about how the computations of neural networks might be structured, and how those computations might be embedded in networks' parameters. Our goal is to come up with an effective, general method to decompose the algorithms learned by neural networks into parts that we can analyse and understand individually.

For various reasons, we've come to think that decomposing network activations layer by layer into features and connecting those features up into circuits... (read 1114 more words →)

108

Replying toShowing SAE Latents Are Not Atomic Using Meta-SAEs

Showing SAE Latents Are Not Atomic Using Meta-SAEs

It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.

At Apollo we're currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, Neel Nanda

Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback!

TL;DR:

Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model’s computation.
We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E-
We do this by training meta-SAEs, an

... (read 5717 more words →)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde

Kola Ayonrinde, Michael Pearce, Lee Sharkey

This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Lee Sharkey and Jan Kulveit.

Note: An updated paper version of this post can be found here.

Abstract

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs on the for reconstruction loss and sparsity results in a preference for SAEs which are extremely wide and sparse.

To resolve this issue, we present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which... (read 4581 more words →)

Replying toCircumventing interpretability: How to defeat mind-readers

Circumventing interpretability: How to defeat mind-readers

So I believe I had in mind "active means [is achieved deliberately through the agent's actions]".

I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they've been selected for passively.

So I'd argue that they all need to be implemented actively (if... (read 434 more words →)

Replying toPacing Outside the Box: RNNs Learn to Plan in Sokoban

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Extremely glad to see this! The Guez et al. model has long struck me as one of the best instances of a mesaoptimizer and it was a real shame that it was closed source. Looking forward to the interp findings!

Replying toEfficient Dictionary Learning with Switch Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

Both of these seem like interesting directions (I had parameters in mind, but params and activations are too closely linked to ignore one or the other). And I don't have a super clear idea but something like representational similarity analysis between SwitchSAEs and regular SAEs could be interesting. This is just one possibility of many though. I haven't thought about it for long enough to be able to list many more, but it feels like a direction with low hanging fruit for sure. For papers, here's a good place to start for RSA: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3730178/

Replying toEfficient Dictionary Learning with Switch Sparse Autoencoders