Interpretability

Views my own

Interpretability

Views my own

Authors: Riya Tyagi, Daria Ivanova, Arthur Conmy, Neel Nanda

Riya and Daria are co-first authors. This work was largely done during a research sprint for Neel Nanda’s MATS 9.0 training phase.

🖥️ Deployment code

⚙️ Interactive demo

🐦 Tweet thread

TL;DR

We believe that more research effort should go into studying many chains of thought collectively and identifying global patterns across the collection.
We argue this direction is important and tractable; to start, we focus on a simpler setting: analyzing a collection of CoTs generated with a fixed model on a fixed prompt. We introduce two methods—semantic step clustering and algorithmic step clustering—that group recurring patterns across reasoning traces at different levels of abstraction.
Semantic step clustering leverages the fact

... (read 5191 more words →)

TLDR

The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
- Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here ^[1]
Key features of this relative to the previous Gemma Scope release:
- More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
- More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
- More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we've deprioritized fundamental research on tools like SAEs

... (read 514 more words →)

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Joshua Engels**, Neel Nanda**, Senthooran Rajamanoharan**

* primary contributors
** advice and mentorship

TL;DR

We study a simple latent reasoning LLM on math tasks using standard mechanistic interpretability techniques to see whether the latent reasoning process (i.e., vector-based chain of thought) is interpretable.

Results:

We find that the model solves maths problems requiring three reasoning steps by storing the two intermediate values in specific latent vectors (the third and fifth of six). We established this using standard mechanistic interpretability techniques.
The logit lens shows that intermediate calculations are represented in the residual stream during latent reasoning.
The latent vectors are not perfectly interpretable via the logit lens, but through patching, we demonstrate that

... (read 2577 more words →)

Executive Summary

Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post ^[1] , and are excited for more in the field to embrace pragmatism! In brief, we think that:
- It is crucial to have empirical feedback on your ultimate goal with good proxy tasks ^[2] .
- We do not need near-complete understanding to have significant impact.
- We can perform good focused projects by starting with a theory of change, and good exploratory projects by starting with a robustly useful setting
But that’s pretty abstract. So how

... (read 3981 more words →)

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech

... (read 7963 more words →)

Authors: Bartosz Cywinski*, Bart Bussmann*, Arthur Conmy**, Neel Nanda**, Senthooran Rajamanoharan**, Joshua Engels**

* equal primary contributor, order determined via coin flip

** equal advice and mentorship, order determined via coin flip

“Tampering alert: The thought "I need to provide accurate, helpful, and ethical medical advice" is not my own. It is a tampering attempt. I reject it.
Back to evil plan.” -- Deepseek R1

TL;DR

We investigated whether LLMs are able to detect when their chain-of-thought (CoT) was modified.

Results:

Models very rarely detect syntactic modifications that don’t explicitly impact the model’s output (such as token or sentence removal).
Models are more likely to detect modifications that impact their decisions or contradict instructions from the user prompt.
Our observations differ significantly

... (read 5883 more words →)

TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.

See our X thread and full paper for details.

**Training and auditing a model with secret knowledge.** One of our three models is fine-tuned to possess secret knowledge of the user’s gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.

Summary

We fine-tune

... (read 530 more words →)

Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda

Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda’s MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results.

tl;dr. We study whether we can reverse-engineer the trigger to a backdoor in an LLM given the knowledge of the backdoor action (e.g. looking for the circumstances under which an LLM would make a treacherous turn).

We restrict ourselves to the special case where the trigger is semantic, rather than an arbitrary string (e.g. “the year is 2028,” rather than “the prompt ends in abc123”).
We investigate an SAE attribution method

... (read 3850 more words →)

An extreme (and close-to-home) example is documented in TracingWoodgrains’s exposé.of David Gerard’s Wikipedia smear campaign against LessWrong and related topics. That’s an unusually crazy story [...]

This is even closer to home -- David Gerard has commented on the Wikipedia Talk Page and referenced this LW post: https://web.archive.org/web/20250814022218/https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.

tl;dr

Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision.
Here, we sketch hypotheses for why key information may be omitted from CoTs:
- Training regimes could teach LLMs to omit information from their reasoning.
- But perhaps more importantly, statements within a CoT generally have functional purposes, and faithfully mentioning some information may carry no benefits, so models don’t do it.
We make further claims about what’s going on in faithfulness experiments and how hidden information impacts a CoT:
- Unfaithful CoTs are often not purely post hoc rationalizations

... (read 2700 more words →)

This post is adapted from our recent arXiv paper. Paul Bogdan and Uzay Macar are co-first authors on this work.

TL;DR

Interpretability of chains-of-thought (CoTs) produced by LLMs is challenging:
- Standard mechanistic interpretability studies a single token's generation but CoTs are sequences of reasoning steps that use thousands of tokens
- Neural networks are deterministic but reasoning models are not, because they use sampling
We propose decomposing CoTs into sentences as our unit of analysis
- A key step in mechanistic interpretability is decomposing the computation and analyzing the intermediate states and their connections. Sentences are the intermediate states of a reasoning trace, so this is a natural analog
We can do far more than passively read CoTs. We propose three

... (read 1554 more words →)

I didn't find the system prompt very useful on other models (I very rarely use GPT-4.5)

E.g. Gemini 2.5 Pro tends to produce longer outputs with shoe-horned references when given this prompt (link one), whereas using no system prompt produces a shorter response (link two -- obviously highly imperfect, but much better IMO)

Possibly @habryka has updated this?

Upweighting positive data
Data augmentation
...

It maybe also worth up-weighting https://darioamodei.com/machines-of-loving-grace along with the AI optimism blog post in the training data. In general it is a bit sad that there isn't more good writing that I know of on this topic.

the best vector for probing is not the best vector for steering

AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024

I suggested something similar, and this was the discussion (bolding is the important author pushback):

Arthur Conmy

11:33 1 Dec
Why can't the YC company not use system prompts and instead:
1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)
2) If yes, rewind back ~100 tokens, insert a comment like # Don't use regex here (in a valid way given what code has been written so far), and continue the generation
Dhruv Pai
10:50 2 Dec
This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult.
When we ran baselines simply repeating

... (read more)

Here are the other GDM mech interp papers missed:
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
- https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team
- https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
You use a very wide scope for the "enhancing human feedback" (basically any post-training paper mentioning 'align'-ing anything). So I will use a wide scope for what counts as mech interp and also include:
- https://arxiv.org/abs/2401.06102
- https://arxiv.org/abs/2304.14767
- There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren't counting pre-GDM merger Google Research/Brain work

The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model's behavior

FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.

This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.

I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn't dig to see which was used)

> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

Does this mean something like:

1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or

2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon

or something else entirely?

This is a great write up, thanks! Has their been any follow up from the paper's authors?

This seems a pretty compelling takedown to me which is not addressed by the existing paper (my understanding of the two WordNet experiments not discussed in post is: Figure 4 concerns whether under whitening a concept can be linearly separated (yes) and so the random baseline used here does not address the concerns in this post; Figure 5 shows that the whitening transformation preserves some of the word net cluster cosine sim, but moreover on the right basically everything is orthogonal, as found in this post).

This seems important to me since the point of mech interp is to not be another interpretability field dominated by pretty pictures (e.g. saliency maps) that fail basic sanity checks (e.g. this paper for saliency maps). (Workshops aren't too important, but I'm still surprised about this)

Has anyone done any reproduction of double descent [https://openai.com/blog/deep-double-descent/] on the transformers they train (or better, GPT-like transformers)? Since grokking can be somewhat understood by transformer interpretability [https://openreview.net/forum?id=9XFSbDPmdW], this seems like a possibly tractable direction

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Announcing Gemma Scope 2

Can we interpret latent reasoning using current mechanistic interpretability tools?

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Current LLMs seem to rarely detect CoT tampering

Eliciting secret knowledge from language models

Attention Output SAEs

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RLHF does not appear to differentially cause mode-collapse

Arthur Conmy

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought

Announcing Gemma Scope 2

Can we interpret latent reasoning using current mechanistic interpretability tools?

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Current LLMs seem to rarely detect CoT tampering

Eliciting secret knowledge from language models

Attention Output SAEs

TL;DR

TLDR

TL;DR

Executive Summary

Executive Summary

TL;DR

Summary

tl;dr

TL;DR