Nate Thomas

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

* Authors sorted alphabetically. Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced. 1 Introduction A question that all mechanistic interpretability work must answer is, “how well does this interpretation explain the phenomenon being studied?”. In the many recent papers in mechanistic interpretability, researchers have generally relied on ad-hoc methods to evaluate the quality of interpretations.[1] This ad hoc nature of existing evaluation methods poses a serious challenge for scaling up mechanistic interpretability. Currently, to evaluate the quality of a particular research result, we need to deeply understand both the interpretation and the phenomenon being explained, and then apply researcher judgment. Ideally, we’d like to find the interpretability equivalent of property-based testing—automatically checking the correctness of interpretations, instead of relying on grit and researcher judgment. More systematic procedures would also help us scale-up interpretability efforts to larger models, behaviors with subtler effects, and to larger teams of researchers. To help with these efforts, we want a procedure that is both powerful enough to finely distinguish better interpretations from worse ones, and general enough to be applied to complex interpretations. In this work, we propose causal scrubbing, a systematic ablation method for testing precisely stated hypotheses about how a particular neural network[2] implements a behavior on a dataset. Specifically, given an informal hypothesis about which parts of a model implement the intermediate calculations required for a b

206Dec 3, 2022

High-stakes alignment via adversarial training [Redwood Research report]

142May 5, 2022

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

135Oct 27, 2022

We're Redwood Research, we do applied alignment research, AMA

56Oct 6, 2021

Nate Thomas

Message

Redwood Research and Constellation

519

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

> This is a link post for two AI safety programs we’ve just opened applications for: https://www.constellation.org/programs/astra-fellowship and https://www.constellation.org/programs/researcher-program Constellation is a research center dedicated to safely navigating the development of transformative AI. We’ve previously helped run the ML for Alignment Bootcamp (MLAB) series and Redwood’s month-long research program on...

Oct 26, 202342

Causal scrubbing: results on induction heads

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...

Dec 3, 202234

Causal scrubbing: results on a paren balance checker

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to an algorithmic model. The results are also summarized here. Introduction In earlier work (unpublished), we dissected a tiny transformer that classifies whether a string of parentheses is balanced or unbalanced.[1] We hypothesized the...

Dec 3, 202239

Causal scrubbing: Appendix

* Authors sorted alphabetically. An appendix to this post. 1 More on Hypotheses 1.1 Example behaviors As mentioned above, our method allows us to explain quantitatively measured model behavior operationalized as the expectation of a function f on a distribution D. Note that no part of our method distinguishes between...

Dec 3, 202218

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Dec 3, 2022206

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations...

Oct 27, 2022135

High-stakes alignment via adversarial training [Redwood Research report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used...

May 5, 2022142

Load More (7/8)

LESSWRONG
LW

LESSWRONG
LW

Nate Thomas

Nate Thomas

Nate Thomas

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

High-stakes alignment via adversarial training [Redwood Research report]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

We're Redwood Research, we do applied alignment research, AMA

Nate Thomas

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

Causal scrubbing: results on induction heads

Causal scrubbing: results on a paren balance checker

Causal scrubbing: Appendix

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

High-stakes alignment via adversarial training [Redwood Research report]

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

High-stakes alignment via adversarial training [Redwood Research report]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

We're Redwood Research, we do applied alignment research, AMA

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

Causal scrubbing: results on induction heads

Causal scrubbing: results on a paren balance checker

Causal scrubbing: Appendix

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

High-stakes alignment via adversarial training [Redwood Research report]