Adam Morris

Tests of LLM introspection need to rule out causal bypassing

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here. There’s been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model’s self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true facts about itself; genuine introspection must causally depend on (i.e., be “grounded” in) the internal state or process that it describes. In other words, a model must report that it possesses State X or uses Algorithm Y because it actually has State X or uses Algorithm Y.[1] We focus on this criterion because it is relevant if we want to leverage LLM introspection for AI safety; self-reports that are causally dependent on the internal states they describe are more likely to retain their accuracy in novel, out-of-distribution contexts. There’s a tricky, widespread complication that arises when trying to establish that a model’s reports about an internal state are causally dependent on that state—a complication which researchers are aware of, but haven’t described at length. The basic method for establishing causal dependence is to intervene on the internal state and test whether changing that state (or creating a novel state) changes the model’s report about the state. The desired causal diagram looks like this: Different papers have implemented this procedure in different ways. Betley et al. and Plunkett et al. intervened on the model’s internal state through supervised fine-tuning—fine-tuning the model to have, e.g., different risk tolerances or to use different decision-making alg

49Nov 28, 2025

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

12Nov 14, 2025

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

10Dec 23, 2024

How are you currently modeling COVID contagiousness?

2Jan 26, 2023

Adam Morris

Message

I'm a computational cognitive scientist studying decision-making and introspection in humans and LLMs. I earned my PhD at Harvard in 2022, and have been a postdoc at Princeton since then.

170

Tests of LLM introspection need to rule out causal bypassing

Nov 28, 202549

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out! It would be useful for safety purposes if...

Nov 14, 202512

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

As a holiday gift to myself, I put together a printable book of some creative writing by Scott Alexander & Eliezer. I'm sharing it here in case anyone else is interested :). It contains most of Scott's published fiction since he switched to Substack, plus some select creative non-fiction pieces...

Dec 23, 202410

How are you currently modeling COVID contagiousness?

My long-distance partner has COVID; her symptoms first started ~10 days ago. She is now symptomless (except fatigue), but is still testing strongly positive on rapids. If she comes to visit me, what are the odds she's still contagious? More generally, how much weight do you place on "symptomless and...

Jan 26, 20232

LESSWRONG
LW

LESSWRONG
LW

Adam Morris

Adam Morris

Adam Morris

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

How are you currently modeling COVID contagiousness?

Adam Morris

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

How are you currently modeling COVID contagiousness?

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

How are you currently modeling COVID contagiousness?

Tests of LLM introspection need to rule out causal bypassing

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Printable book of some rationalist creative writing (from Scott A. & Eliezer)

How are you currently modeling COVID contagiousness?