This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."
Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant... (read 631 more words →)
Hey, thanks for the thoughts -- some good points. We'd like to offer some context to help clarify the results.
First, you're absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, "can fool" in the title does seem like an understatement given these numbers -- the monitors are being fooled most of the time on subtle tasks.
However, a few important contextual points:
The subtle tasks are extremely hard to detect. These are things like covertly increasing runtime complexity or introducing edge-case failures that would be very difficult even for humans to spot in code review. The paper... (read more)