Tomek Korbak

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time). In a new multi-org position paper, we argue that we should try to preserve this level of reasoning transparency and turn chain of thought monitorability into a systematic AI safety agenda. This is a measure that improves safety in the medium term, and it might not scale to superintelligence even if somehow a superintelligent AI still does its reasoning in English. We hope that extending the time when chains of thought are monitorable will help us do more science on capable models, practice more safety techniques "at an easier difficulty", and allow us to extract more useful work from potentially misaligned models. Abstract: AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

167Jul 15, 2025

Tomek Korbak

Message

I work on monitoring agents at OpenAI

https://tomekkorbak.com/

1070

253

Training Agents to Self-Report Misbehavior

TL;DR: Frontier AI agents may pursue hidden goals while concealing this pursuit from oversight. Currently, we use two main approaches to reduce this risk: (1) Alignment trains the agent to not misbehave, (2) Blackbox monitoring uses a separate model to detect misbehavior. We study a third approach—self-incrimination—which trains agents to...

Feb 2521

Lessons from Studying Two-Hop Latent Reasoning

Twitter | ArXiv Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared...

Sep 11, 202568

If you can generate obfuscated chain-of-thought, can you monitor it?

tldr: Chain-of-thought (CoT) monitoring is a proposed safety mechanism for overseeing AI systems, but its effectiveness against deliberately obfuscated reasoning remains unproven. "Obfuscated" reasoning refers to reasoning which would not be understood by a human, but which contains information that a model could understand in principle, for example model-encrypted reasoning....

Aug 4, 202536

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202525

The Alignment Project by UK AISI

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Jul 15, 2025167

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

This is a linkpost accompanying a new paper by UK AI Security Institute, Apollo Research and Redwood Research. Please see the full paper for more details. TLDR: Our new paper outlines how AI developers should adapt the methodology used in control evaluations as capabilities of LLM agents increase. Figure: We...

Apr 14, 202529

Load More (7/17)

LESSWRONG
LW

LESSWRONG
LW

Tomek Korbak

Tomek Korbak

Tomek Korbak

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Pretraining Language Models with Human Preferences

Paper: LLMs trained on “A is B” fail to learn “B is A”

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak

Training Agents to Self-Report Misbehavior

Lessons from Studying Two-Hop Latent Reasoning

If you can generate obfuscated chain-of-thought, can you monitor it?

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Pretraining Language Models with Human Preferences

Paper: LLMs trained on “A is B” fail to learn “B is A”

RL with KL penalties is better seen as Bayesian inference

Training Agents to Self-Report Misbehavior

Lessons from Studying Two-Hop Latent Reasoning

If you can generate obfuscated chain-of-thought, can you monitor it?

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence