Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.

TLDR

The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
- Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here ^[1]
Key features of this relative to the previous Gemma Scope release:
- More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
- More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
- More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we've deprioritized fundamental research on tools like SAEs

... (read 514 more words →)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

* = equal contribution

The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders (SAEs) were useful for downstream tasks, notably out-of-distribution probing.

TL;DR

To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in

... (read 8585 more words →)

If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?

Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).

I am especially concerned about this when training on the final answer.

Maybe my question just boils down to

How do you distinguish between "paraphrasing didn't break the syntactical <> semantic link" and "the model isn't relying on that link / the link doesn't exist in the first place".

Great experiments!

Questions:

To clarify, when distilling did you also train the model to predict the final answer or only the CoT? I assume the latter but it wasn't 100% clear from the text.
. What's the variance on the pass@1 performance? (e.g. via bootstrap sampling) You say the performance of "chunked paraphrasing" is worse, but the degradation looks pretty minor and without knowing the variance it's hard to know whether this is a meaningful difference.
"We find that the pass@1 results are worse, especially for small models." Do you mean large models?

Notes:

I'm somewhat confused why you claim that paraphrasing the whole CoT (i.e. non-chunked) breaks the causal link between syntax(t=0) and semantics(t=1) if you then... (read more)

We use 1024, though often article snippets are shorter than that so they are separated by BOS.

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!

We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0.

To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're... (read 247 more words →)

Cool work!

Did you run an ablation on the auxiliary losses for and $Q_{r e c o n} \cdot K_{c l e a n}$ , how important was that to stabilize training?

Did you compare to training separate Q and K SAEs via typical reconstruction loss? Would be cool to see a side-by-side comparison, i.e. how large the benefit of this scheme is.

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the... (read 2340 more words →)

Introduction

This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.

Our team’s two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One... (read 815 more words →)

Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum

Tweet thread summary, paper

Abstract:

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these

... (read more)

During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don't have an incentive to distinguish between "look for B" and "look for the second choice". Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

Informal TLDR

We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla (70B) for converting the knowledge of a multiple choice question's answer into outputting the correct letter.
- These techniques basically continued to work, and nothing fundamentally broke at scale (though it was a massive infra pain!).
We then tried to dig further into the semantics of the circuit - going beyond "these specific heads and layers matter and most don't" and trying to understand the learned algorithm, and

... (read 461 more words →)

Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).

Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions... (read more)

A significantly updated version of this work is now on Arxiv and was published as a spotlight paper at ICLR 2023

aka, how the best way to do modular addition is with Discrete Fourier Transforms and trig identities

If you don't want to commit to a long post, check out the Tweet thread summary

Introduction

Grokking is a recent phenomena discovered by OpenAI researchers, that in my opinion is one of the most fascinating mysteries in deep learning. That models trained on small algorithmic tasks like modular addition will initially memorise the training data, but after a long time will suddenly learn to generalise to unseen data.

A training curve for a 1L Transformer trained to do

... (read 10581 more words →)

This is a project by Marius Hobbhahn and Tom Lieberum. David Seiler contributed ideas and guidance.

Executive summary

We want to investigate the quality of LLMs causal world models in very simple settings. To this end, we test whether they can identify cause and effect in natural language settings (taken from BigBench) such as “My car got dirty. I washed the car. Question: Which sentence is the cause of the other?” and in toy settings such as the one detailed below. We probe this world model by changing the presentation of the prompt while keeping the meaning constant. Additionally, we test if the model can be “tricked” into giving wrong answers when we present the... (read 3610 more words →)

LESSWRONG
LW

LESSWRONG
LW

Tom Lieberum

A Mechanistic Interpretability Analysis of Grokking

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Announcing Gemma Scope 2

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Tom Lieberum

Announcing Gemma Scope 2

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Summary] Progress Update #1 from the GDM Mech Interp Team

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Tom Lieberum

A Mechanistic Interpretability Analysis of Grokking

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Announcing Gemma Scope 2

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Tom Lieberum

Announcing Gemma Scope 2

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Improving Dictionary Learning with Gated Sparse Autoencoders

[Full Post] Progress Update #1 from the GDM Mech Interp Team

[Summary] Progress Update #1 from the GDM Mech Interp Team

AtP*: An efficient and scalable method for localizing LLM behaviour to components

TLDR

TL;DR

Activation Steering with SAEs

Introduction

Informal TLDR

Introduction

Executive summary