leogao

An Ambitious Vision for Interpretability

The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI’s death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored. The value of understanding Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons. First, mechanistic understanding can make it much easier to figure out what’s actually going on, especially when it’s hard to distinguish hypotheses using external behavior (e.g if the model is scheming). We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were making. A typical debugging session. Second, since AGI will likely look very different from current models, we’d prefer to gain knowledge that applies beyond current models. This is one of the core difficulties of alignment that every alignment research agenda has to contend with. The more you understand why your alignment approach works, the more likely it is to keep working in the future, or at least warn you before it fails. If you’re just whacking your model on the head, and it seems to work but you don’t really know why, then you really have no idea when it might suddenly stop working. If you’ve ever tried to fix broken software by toggling vaguely relevant sounding config options until it works again, you k

170Dec 5, 2025

leogao

Message

9483

1038

681

An Ambitious Vision for Interpretability

Dec 5, 2025170

My takes on SB-1047

I recently decided to sign a letter of support for SB 1047. Before deciding whether to do so, I felt it was important for me to develop an independent opinion on whether the bill was good, as opposed to deferring to the opinions of those around me, so I read...

Sep 9, 2024152

Scaling and evaluating sparse autoencoders

[Blog] [Paper] [Visualizer] Abstract: > Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of...

Jun 6, 2024112

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Links: Blog, Paper. Abstract: > Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too...

Dec 16, 202355

Shapley Value Attribution in Chain of Thought

TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project...

Apr 14, 2023106

[ASoT] Some thoughts on human abstractions

TL;DR: * Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans. * This is not the same thing as...

Mar 16, 202342

Clarifying wireheading terminology

See also: Towards deconfusing wireheading and reward maximization, Everett et al. (2019). There are a few subtly different things that people call "wireheading". This post is intended to be a quick reference for explaining my views on the difference between these things. I think these distinctions are sometimes worth drawing...

Nov 24, 202267

Load More (7/33)

LESSWRONG
LW

LESSWRONG
LW

leogao

leogao

leogao

An Ambitious Vision for Interpretability

My takes on SB-1047

Scaling and evaluating sparse autoencoders

Shapley Value Attribution in Chain of Thought

leogao

An Ambitious Vision for Interpretability

My takes on SB-1047

Scaling and evaluating sparse autoencoders

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Shapley Value Attribution in Chain of Thought

[ASoT] Some thoughts on human abstractions

Clarifying wireheading terminology

An Ambitious Vision for Interpretability

My takes on SB-1047

Scaling and evaluating sparse autoencoders

Shapley Value Attribution in Chain of Thought

An Ambitious Vision for Interpretability

My takes on SB-1047

Scaling and evaluating sparse autoencoders

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Shapley Value Attribution in Chain of Thought

[ASoT] Some thoughts on human abstractions

Clarifying wireheading terminology