Tomek Korbak

Predicting LLM Safety Before Release by Simulating Deployment

Paper link Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we...

Jun 1635

Reasoning Models Struggle to Control Their Chains of Thought

by Yueh Han "John" Chen, robert mccarthy, Bruce W. Lee, and Tomek Korbak

Authors: Yueh-Han Chen, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak In collaboration with OpenAI TL;DR: Chain-of-thought (CoT) monitoring can detect misbehavior in reasoning models, but only if models cannot control what they verbalize. To measure this undesirable ability, CoT Controllability, we introduce...

Mar 576

Training Agents to Self-Report Misbehavior

by Bruce W. Lee, Yueh Han "John" Chen, and Tomek Korbak

TL;DR: Frontier AI agents may pursue hidden goals while concealing this pursuit from oversight. Currently, we use two main approaches to reduce this risk: (1) Alignment trains the agent to not misbehave, (2) Blackbox monitoring uses a separate model to detect misbehavior. We study a third approach—self-incrimination—which trains agents to...

Feb 2526

Lessons from Studying Two-Hop Latent Reasoning

by Mikita Balesni, Tomek Korbak, and Owain_Evans

Twitter | ArXiv Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared...

Sep 11, 202569

If you can generate obfuscated chain-of-thought, can you monitor it?

by Asa Cooper Stickland and Tomek Korbak

tldr: Chain-of-thought (CoT) monitoring is a proposed safety mechanism for overseeing AI systems, but its effectiveness against deliberately obfuscated reasoning remains unproven. "Obfuscated" reasoning refers to reasoning which would not be understood by a human, but which contains information that a model could understand in principle, for example model-encrypted reasoning....

Aug 4, 202536

Research Areas in AI Control (The Alignment Project by UK AISI)

by Julian Stastny, Tomek Korbak, Mojmir, Buck, and Alan Cooney

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202525

The Alignment Project by UK AISI

by Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, and Edmund Lau

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

Tomek Korbak

Tomek Korbak

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Pretraining Language Models with Human Preferences

Paper: LLMs trained on “A is B” fail to learn “B is A”

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Pretraining Language Models with Human Preferences

Paper: LLMs trained on “A is B” fail to learn “B is A”

RL with KL penalties is better seen as Bayesian inference

Predicting LLM Safety Before Release by Simulating Deployment

Reasoning Models Struggle to Control Their Chains of Thought

Training Agents to Self-Report Misbehavior

Lessons from Studying Two-Hop Latent Reasoning

If you can generate obfuscated chain-of-thought, can you monitor it?

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project by UK AISI