Transformer Circuits Thread

Transformer Circuits Thread

Articles

April 2025

Circuits Updates — April 2025

A collection of small updates: jailbreaks, dense features, and spinning up on interpretability.

Progress on Attention

An update on our progress studying attention.
March 2025

On the Biology of a Large Language Model

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts.

Circuit Tracing: Revealing Computational Graphs in Language Models

We describe an approach to tracing the "step-by-step" computation involved when a model responds to a single prompt.
February 2025

Insights on Crosscoder Model Diffing

A preliminary note on using crosscoders to diff models.
January 2025

Circuits Updates — January 2025

A collection of small updates: dictionary learning optimization techniques.
December 2024

Stage-Wise Model Diffing

A preliminary note on model diffing through dictionary fine-tuning.
October 2024

Sparse Crosscoders for Cross-Layer Features and Model Diffing

A preliminary note on a way to get consistent features across layers, and even models.

Using Dictionary Learning Features as Classifiers

A preliminary note comparing feature-based and raw-activation based harmfulness classifiers.
September 2024

Circuits Updates — September 2024

A collection of small updates: investigating successor heads, oversampling data in SAEs.
August 2024

Circuits Updates — August 2024

A collection of small updates: interpretability evals, reproducing self-explanation.
July 2024

Circuits Updates — July 2024

A collection of small updates: five hurdles, linear representations, dark matter, pivot tables, feature sensitivity.
June 2024

Circuits Updates — June 2024

A collection of small updates: topk and gated SAE investigation.
May 2024

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Using a sparse autoencoder, we extract a large number of interpretable features from Claude 3 Sonnet. Some appear to be safety-relevant.
April 2024

Circuits Updates — April 2024

A collection of small updates from the Anthropic Interpretability Team.
March 2024

Circuits Updates — March 2024

A collection of small updates from the Anthropic Interpretability Team.

Reflections on Qualitative Research

Some opinionated thoughts on why interpretability research may have qualitative aspects be more central than we're used to in other fields.
February 2024

Circuits Updates — February 2024

A collection of small updates from the Anthropic Interpretability Team.
January 2024

Circuits Updates — January 2024

A collection of small updates from the Anthropic Interpretability Team.
October 2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
July 2023

Circuits Updates — July 2023

A collection of small updates from the Anthropic Interpretability Team.
May 2023

Circuits Updates — May 2023

A collection of small updates from the Anthropic Interpretability Team.

Interpretability Dreams

Our present research aims to create a foundation for mechanistic interpretability research. In doing so, it's important to keep sight of what we're trying to lay the foundations for.

Distributed Representations: Composition & Superposition

An informal note on how "distributed representations" might be understood as two different, competing strategies — "composition" and "superposition" — with quite different properties.
March 2023

Privileged Bases in the Transformer Residual Stream

Our mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance, but recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect.
January 2023

Superposition, Memorization, and Double Descent

We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data.
September 2022

Toy Models of Superposition

Neural networks often seem to pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity'. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood.
June 2022

Softmax Linear Units

An alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts.

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

An informal note on intuitions related to mechanistic interpretability.
March 2022

In-Context Learning and Induction Heads

An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. We also report the existence of a previously unknown phase change in transformers language models.
paper
December 2021

A Mathematical Framework for Transformer Circuits

Our early mathematical framework for reverse engineering models, demonstrated by reverse engineering small toy models.
paper

Exercises

Some exercises we've developed to improve our understanding of how neural networks implement algorithms at the parameter level.
note, exercises

Videos

Very rough informal talks as we search for a way to reverse engineering transformers.
links, videos

PySvelte

One approach to bridging Python and web-based interactive diagrams for interpretability research.
github link, infrastructure

Garcon

A description of our tooling for doing interpretability on large models.
note, infrastructure
March 2020 - April 2021

Original Distill Circuits Thread

Our exploration of Transformers builds heavily on the original Circuits thread on Distill.

About the Transformer Circuits Thread Project

Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the Distill Circuits Thread, we're going to try.

We think interpretability research benefits a lot from interactive articles (see Activation Atlases for a striking example). Previously we would have submitted to Distill, but with Distill on Hiatus, we're taking a page from David Ha's approach of simply creating websites (eg. World Models) for research projects.

As part of our effort to reverse engineer transformers, we've created several other resources besides our paper which we hope will be useful. We've collected them on this website, and may add future content here, or even collaborations with other institutions.