11d

Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort.

TL;DR

Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic interpretability. Because these tools operate at a relatively low level, their outputs are often treated as reliable descriptions of what a model is doing, not just predictive approximations. We therefore ask: when are CLT-derived circuits faithful to the model’s true internal computation?

In a Boolean toy model with known ground truth, we show a specific unfaithfulness mode: CLTs can rewrite deep multi-hop circuits into sums... (read 5126 more words →)

Replying toAn Ambitious Vision for Interpretability

RGRGRG2mo

An Ambitious Vision for Interpretability

Strongly agree about the importance of ambitious mech interp.

My personal belief is that we should go back to tiny "toy" models, starting with 1-layer models and fully understand them before scaling up to 2-layer models, then 4-layer models, etc.

I put together a proposal to start a research lab focused on ambitious mech interp for small models - please reply or ping if you're interested in discussing:

https://docs.google.com/document/d/14WJK81ZM6IcF8igVxwmFLuTNunYyuVTPKGG5iRdb2Nk/edit?usp=sharing

Alternative Models of Superposition

zroe1

zroe1, RGRGRG

6mo

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted these experiments during continued work following the SPAR Spring 2025 cohort.

Disclaimer / Epistemic status: We spent roughly 30 hours on this post. We are not confident in these findings but we think they are interesting and worth sharing.

We assume some basic familiarity with the superposition hypothesis from Elhage et al., 2022, but have an additional "Preliminaries" section below to provide background.

Summary

Toy Models of Superposition (2022)—which we call TMS for short—demonstrates that a toy autoencoder can represent 5 features in a rank 2 matrix. In other words, the paper shows that models can represent more features than they have dimensions. While the original model proposed... (read 1448 more words →)

Replying toBridging the VLM and mech interp communities for multimodal interpretability

RGRGRG1y

Bridging the VLM and mech interp communities for multimodal interpretability

Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity." Thanks.

RGRGRG1y

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:

https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

Replying toGrowth and Form in a Toy Model of Superposition

RGRGRG1y

Growth and Form in a Toy Model of Superposition

Would you be willing to share the raw data from the plot - "Developmental Stages of TMS", I'm specifically hoping to look at line plots of weights vs biases over time.

Thanks.

Replying toThere Should Be More Alignment-Driven Startups

RGRGRG2y

There Should Be More Alignment-Driven Startups

What are the terms of the seed funding prize(s)?

Replying toMechanistically Eliciting Latent Behaviors in Language Models

RGRGRG2y

Mechanistically Eliciting Latent Behaviors in Language Models

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

Replying toTranscoders enable fine-grained interpretable circuit analysis for language models

RGRGRG2y

Transcoders enable fine-grained interpretable circuit analysis for language models

Question about the "rules of the game" you present. Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators. From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features. Thank you.

Replying toFinding Sparse Linear Connections between Features in LLMs

RGRGRG2y

Finding Sparse Linear Connections between Features in LLMs

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

Replying toGrowth and Form in a Toy Model of Superposition

RGRGRG2y

Growth and Form in a Toy Model of Superposition

This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.

Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude. Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s roughly equally?

Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)

Replying toGrowth and Form in a Toy Model of Superposition

RGRGRG2y

Growth and Form in a Toy Model of Superposition

Question about the gif - to me it looks like the phase transition is more like:

4++- to unstable 5+- to 4+- to 5-
(Unstable 5+- seems to have similar loss to 4+-).

Why do we not count the large red bar as a "-" ?

Seeking Feedback on My Mechanistic Interpretability Research Agenda

RGRGRG

Why this post

I’ve been doing MI research full-time for about three months and since my current grant is ending soon and I recently received a Lightspeed rejection, now seems like a good time to take some time away from object-level work to reflect on direction and next steps. I spent a few hours writing a draft, realized I thought it was too complicated (for the sake of sounding complicated?) and rewrote what I really want to do and why in ~750 words. I intend to incorporate feedback in an OpenPhil early career application (which I will submit in a few days). I am also applying to AI research position(s) and if I receive... (read 781 more words →)

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG

1) Introduction

In February, Stephen Casper posted two Mechanistic Interpretability challenges. The first of these challenges asks participants to uncover a secret labeling function from a trained CNN and was solved by Stefan Heimersheim and Marius Hobbhahn.

The second of these challenges, which will be the focus of this post, asks participants to uncover a different secret labeling function from a trained transformer and was solved* by the same individuals. Stephen marked this second problem as “solved*” (with an asterisk) since “[this solution] did not find pseudocode for the labeling function, but instead made a strong case that it would not be tractable to find this. In this case, the network seemed to learn to label points by interpolating from... (read 5851 more words →)

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG

Hey Everyone! I recently left my FAANG job to split my time between doing Alignment Research (70%) and investigating start-up ideas (30%).

If I decide to fully commit to Alignment Research, what is the best way to go about applying for and/or getting funding? (In a perfect world, this funding would cover compute and SF living expenses).

Thanks! RGRGRG

LESSWRONG
LW

LESSWRONG
LW

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Alternative Models of Superposition

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Alternative Models of Superposition

Seeking Feedback on My Mechanistic Interpretability Research Agenda

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Alternative Models of Superposition

Best Ways to Try to Get Funding for Alignment Research?

RGRGRG

RGRGRG

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Alternative Models of Superposition

Seeking Feedback on My Mechanistic Interpretability Research Agenda

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

Best Ways to Try to Get Funding for Alignment Research?

TL;DR

Summary

Why this post

1) Introduction