Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring
This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring." Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies. A summary of our method. Prompts including a legitimate main task, a subversive side task, and an attack policy intended to subvert monitoring are fed to a strong untrusted reasoning LLM. We assess the ability of a monitor based on a weaker trusted LLM to catch the resulting misbehaviour, with and without access to the CoT. Task Subtlety Matters We evaluated the efficacy of CoT monitoring against models instructed to pursue harmful auxiliary goals, while attempting to evade detection. Our experiments consistently demonstrate that the effectiveness of CoT monitoring depends heavily on the nature of the sabotage attempt. For subtle sabotage (like increasing runtime complexity, inserting edge-case failures, or modifying default function arguments), which leave minimal traces in the final code, CoT monitoring substantially improved detection rates by a factor of 2-7 compared to action-only monitoring. For non-subtle sab
I've also been thinking about the merits and drawbacks of such an approach. I've been referring to it as "the Tokyo Drift strategy". Accelerate towards the edge of the cliff, then pull the handbrake and spin the wheel at the last possible second, sliding towards the edge of the cliff while accelerating in a new direction. In my mind's eye I see the rear tire sending a cloud of dust and pebbles over the edge as it slides by, a fraction of an inch from destruction.
Not a reassuring vision, but it would certainly make for a dramatic historical recreation if we survive!