This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here.
There’s been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model’s self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true facts about itself; genuine introspection must causally depend on (i.e., be “grounded” in) the internal state or process that it describes. In other words, a model must report that it possesses State X or uses Algorithm Y because it actually has State X or uses Algorithm Y.[1] We focus on this criterion because it is relevant if we want to leverage LLM introspection for AI safety; self-reports that are causally dependent on the internal states they describe are more likely to retain their accuracy in novel, out-of-distribution contexts.
There’s a tricky, widespread complication that arises when trying to establish that a model’s reports about an internal state are causally dependent on that state—a complication which researchers are aware of, but haven’t described at length. The basic method for establishing causal dependence is to intervene on the internal state and test whether changing that state (or creating a novel state) changes the model’s report about the state. The desired causal diagram looks like this:
Different papers have implemented this procedure in different ways. Betley et al. and Plunkett et al. intervened on the model’s internal state through supervised fine-tuning—fine-tuning the model to have, e.g., different risk tolerances or to use different decision-making alg