It's funny how this is like a reverse Searle's Chinese room. A system meant to just shuffle some tokens around can't help but understand its meaning!

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda, Senthooran Rajamanoharan

2mo

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan**

* primary contributors
** advice and mentorship

TL;DR

We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable.

Results:

We find that the model solves maths problems requiring three reasoning steps by storing the two intermediate values in specific latent vectors (the third and fifth of six). We established this using standard mechanistic interpretability techniques.
The logit lens shows that intermediate calculations are represented in the residual stream during latent reasoning.
The latent vectors are not perfectly interpretable via the logit lens, but through patching, we demonstrate that

... (read 2577 more words →)

Bart Bussmann2mo

Update! I missed an entire evolutionary branch of the meme: "You can just do stuff" (rather than "things").

In March 2021, @leaacta tweets:

life hack: you don't have to explain yourself or understand anything, you can just do stuff

And gets retweeted by a bunch of people in TPOT.

Then, in June 2022, comedian Rodney Norman posts a video called Go Be Weird with a motivational speech of some sort:

Hey, you know you can just do stuff?
Like, you don't need anybody's permission or anything.
You just... you just kind of come up with weird stuff you want to go do, and you just go do it.
Okay? Go be weird.
Okay, bye.

In August 2022, @nat_sharpe_ posts a video where... (read more)

Bart Bussmann2moQuick Take

On the origins of "you can just do things"

About once every 15 minutes, someone tweets "you can just do things". It seems like a rather powerful and empowering meme and I was curious where it came from, so I did some research into its origins. Although I'm not very satisfied with what I was able to reconstruct, here are some of the things that I found:

In 1995, Steve Jobs gives the following quote in an interview:

Life can be much broader, once you discover one simple fact, and that is that everything around you that you call life was made up by people that were no smarter than you. And you can change

... (read more)

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan, Josh Engels

3mo

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

Models very rarely detect syntactic modifications that don’t explicitly impact the model’s output (such as token or sentence removal).
Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
Our observations differ significantly

... (read 5883 more words →)

Bart Bussmann7moQuick Take

When working with SAE features, I've usually relied on a linear intuition: a feature firing with twice the strength has about twice the "impact" on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature's additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens $P (A) / P (B)$ is equal to the exponent of the logit difference $exp (logit (A) - logit (B))$ .

If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn't 5x its... (read more)

Replying toLearning Multi-Level Features with Matryoshka SAEs

Bart Bussmann10mo

Learning Multi-Level Features with Matryoshka SAEs

Interesting idea, I had not considered this approach before!

I'm not sure this would solve feature absorption though. Thinking about the "Starts with E-" and "Elephant" example: if the "Elephant" latent absorbs the "Starts with E-" latent, the "Starts with E-" feature will develop a hole and not activate anymore on the input "elephant". After the latent is absorbed, "Starts with E-" wouldn't be in the list to calculate cumulative losses for that input anymore.

Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.

Replying toLearning Multi-Level Features with Matryoshka SAEs

Bart Bussmann1y

Learning Multi-Level Features with Matryoshka SAEs

Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).

Replying toHire (or Become) a Thinking Assistant

Bart Bussmann1y

Hire (or Become) a Thinking Assistant

I haven't actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann

Bart Bussmann, Patrick Leask, Neel Nanda

TL;DR: Matryoshka SAEs are a new variant of sparse autoencoders that learn features at multiple levels of abstraction by splitting the dictionary into groups of latents of increasing size. Earlier groups are regularized to reconstruct well without access to later groups, forcing the SAE to learn both high-level concepts and low-level concepts, rather than absorbing them in specific low-level features. Due to this regularization, Matryoshka SAEs reconstruct less well than standard BatchTopK SAEs trained on Gemma-2-2B, but their downstream language model loss is similar. They show dramatically lower rates of feature absorption, feature splits and shared information between latents. They perform better on targeted concept erasure tasks, but show mixed results on... (read 3139 more words →)

Replying toMatryoshka Sparse Autoencoders

Bart Bussmann1y

Matryoshka Sparse Autoencoders

Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:

The SAEs here were only trained for 100M tokens (1/3 the TinyStories^[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more 'real' language models and train SAEs with much more data.

I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!

Is there a nice way to extend the Matryoshka method to top-k SAEs?

Yes! My experiments with Matryoshka SAEs are using BatchTopK.

Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).

Replying toVisible Thoughts Project and Bounty Announcement

Bart Bussmann1y

Visible Thoughts Project and Bounty Announcement

Three years later, and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.

I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.

Good call!

Bart Bussmann1y

Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, Neel Nanda

Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda’s streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback!

TL;DR:

Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model’s computation.
We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E-
We do this by training meta-SAEs, an

... (read 5717 more words →)

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask

Patrick Leask, Bart Bussmann, Neel Nanda

TL;DR: We demonstrate that the decoder directions of GPT-2 SAEs are highly structured by finding a historical date direction onto which projecting non-date related features lets us read off their historical time period by comparison to year features.

Calendar years are linear: there are as many years between 2000 and 2024, as there are between 1800 and 1824. Linear probes can be used to predict years of particular events from the activations of language models. Since calendar years are linear, one might think the same of other time-based features such as weekday features, however weekday activations in sparse autoencoders (SAEs) were recently found to be arranged in a circular configuration in their top principal components. Inspired... (read 1349 more words →)

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann

Bart Bussmann, Patrick Leask, Neel Nanda

Work done in Neel Nanda’s stream of MATS 6.0.

Epistemic status: Tried this on a single sweep and seems to work well, but it might definitely be a fluke of something particular to our implementation or experimental set-up. As there are also some theoretical reasons to expect this technique to work (adaptive sparsity), it seems probable that for many TopK SAE set-ups it could be a good idea to also try BatchTopK. As we’re not planning to investigate this much further and it might be useful to others, we’re just sharing what we’ve found so far.

TL;DR: Instead of taking the TopK feature activations per token during training, taking the Top(K*batch_size) for every batch seems to... (read 1102 more words →)

Stitching SAEs of different sizes

Bart Bussmann

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, Neel Nanda

Work done in Neel Nanda’s stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University

TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Introduction

Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when... (read 3435 more words →)

According to this Nature paper, the Atlantic Meridional Overturning Circulation (AMOC), the "global conveyor belt", is likely to collapse this century (mean 2050, 95% confidence interval is 2025-2095).

Another recent study finds that it is "on tipping course" and predicts that after collapse average February temperatures in London will decrease by 1.5 °C per decade (15 °C over 100 years). Bergen (Norway) February temperatures will decrease by 35 °C. This is a temperature change about an order of magnitude faster than normal global warming (0.2 °C per decade) but in the other direction!

This seems like a big deal? Anyone with more expertise in climate sciences want to weigh in?

Bart Bussmann's Shortform

Bart Bussmann

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

My Trial Period as an Independent Alignment Researcher

Bart Bussmann

In the past two months, I have tried out what is like to be an independent alignment researcher. My goals were to figure out if this path is something I would like to do, whether I'm a good fit, which research areas are most promising for me, and whether I feel like I can actually contribute something to the alignment problem.

My approach was to dive into different alignment subfields. In each subfield, I aimed to identify an open problem, work for about a week or two on this problem, and track feelings of hope and progress. This post is a reflection of this two-month trial period.

Being an independent researcher is great

Seriously. Being... (read 737 more words →)

Interpreting Modular Addition in MLPs

Bart Bussmann

Summary

In this post, we investigate one of Neel Nanda's 200 Concrete Open Problems in Mechanistic Interpretability, namely problem A3.3: Interpret a 2L MLP (one hidden layer) trained to do modular addition.

The network seems to learn the following function:

$l o g i t (c) \propto \sum_{i = 0}^{N} relu (u_{1_{i}} (cos (w_{i} a + s_{1_{i}}) + cos (w_{i} b + s_{2_{i}}))) (u_{2_{i}} cos (w_{i} c + s_{1_{i}} + s_{2_{i}}) + o_{i})$

The code to reproduce the experiments in this post can be found here.

Background

In their paper, Progress Measures For Grokking Via Mechanistic Interpretability, Neel Nanda et. al find that the one-layer transformers learn a surprisingly funky algorithm for modular addition. Modular addition is the function $c = (a + b) mod P$ , where P = 113 in both my and the original experiments.

In the original work, they find that a 1-layer Transformer learns an algorithm where the numbers are converted to frequencies with... (read 1563 more words →)

LESSWRONG
LW

LESSWRONG
LW

Bart Bussmann

60+ Possible Futures

Showing SAE Latents Are Not Atomic Using Meta-SAEs

BatchTopK: A Simple Improvement for TopK-SAEs

Current LLMs seem to rarely detect CoT tampering

Bart Bussmann

Can we interpret latent reasoning using current mechanistic interpretability tools?

Current LLMs seem to rarely detect CoT tampering

Learning Multi-Level Features with Matryoshka SAEs

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

BatchTopK: A Simple Improvement for TopK-SAEs

Stitching SAEs of different sizes

Bart Bussmann

60+ Possible Futures

Showing SAE Latents Are Not Atomic Using Meta-SAEs

BatchTopK: A Simple Improvement for TopK-SAEs

Current LLMs seem to rarely detect CoT tampering

Bart Bussmann

Can we interpret latent reasoning using current mechanistic interpretability tools?

Current LLMs seem to rarely detect CoT tampering

Learning Multi-Level Features with Matryoshka SAEs

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

BatchTopK: A Simple Improvement for TopK-SAEs

Stitching SAEs of different sizes

TL;DR

On the origins of "you can just do things"

TL;DR

Introduction

Being an independent researcher is great

Summary

Background