[Research Note] Optimizing The Final Output Can Obfuscate CoT
Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on. TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient to suppress that property in the chain of thought also, even when that property is relevant to task completion. For example, when we penalize a model for mentioning in its output that it completed a task via a certain form of cheating, its reasoning also omits this fact. We therefore suspect that training against an output monitor is sufficient to cause obfuscated CoTs in more realistic settings. Introduction Chain of thought (CoT) supervision appears in many control and scalable oversight protocols. It has been argued that being able to monitor CoTs for unwanted behavior is a critical property for safety that we should seek to maintain. Recent work by OpenAI showed that using a CoT monitor as a training signal can cause the CoT to become obfuscated, meaning that instead of preventing unwanted behavior, the training causes the model to perform the behavior in a way that the monitor cannot detect. As a result, the authors recommend against optimizing the CoT to look good. We show that when post-training a language model with RL in toy environments, training against a monitor that only sees the output - not the CoT - suffices to induce CoT obfuscation. We refer to the effect as feedback spillover. We propose two causes of feedback spillover: 1. Parametric feedback spillover: Models trained to produce safe-looking final outputs may generalize to making all their generations, including CoTs, look safe. * Proposed mitigation: Use distinct models for CoT and output generation (see Daniel Kokotajlo’s Shoggoth+Face proposal). To avoid loaded terminology, we refer to the "mind" (generates CoT) and the "face" (generates the user-facing output). 2. Conditional feedback spill