james.lucassen

Retrying vs Resampling in AI Control

We’ve just released a new paper: Retrying vs Resampling in AI Control. We revisit the resampling protocols introduced in Ctrl-Z with an up-to-date setting and much stronger models, and compare them against “retrying” protocols similar to Claude Code auto mode or Codex Auto-review. Motivation Roughly a year ago we released...

May 2967

Blocking live failures with synchronous monitors

by Adam Kaufman and james.lucassen

A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur. Monitoring can serve two different goals. The first is detection: identifying misbehavior so you can understand it and prevent...

Mar 3024

Hidden Role Games as a Trusted Model Eval

TLDR: to be dangerous, early schemers will have to do a bunch of adversarial reasoning under uncertainty. The current models seem extremely bad at this kind of reasoning relative to R&D automation capabilities like coding. I’m quite happy with that and I’d like to keep an eye on it as...

Mar 1614

BashArena: A Control Setting for Highly Privileged AI Agents

We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In this post we’ll discuss the strengths and weaknesses of BashArena, and what we’ve learned about how to make settings for high stakes control research. BashArena Overview...

Dec 18, 202558

Decision Theory Guarding is Sufficient for Scheming

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this. The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore...

Sep 9, 202536

How Can You Tell if You've Instilled a False Belief in Your LLM?

In the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum,...

Sep 6, 202520

On Contact, Part 1

Context: for fun (and profit?) Basic Contact Contact is a lightweight many-versus-one word guessing game. I was first introduced to it on a long bus ride several years ago, and since then it’s become one of my favorite games to play casually with friends. There are a few blog posts...

Jan 21, 202514

james.lucassen

james.lucassen

Retrospective: 12 [sic] Months Since MIRI

Retrying vs Resampling in AI Control

Evaluating Stability of Unreflective Alignment

BashArena: A Control Setting for Highly Privileged AI Agents

james.lucassen

Retrospective: 12 [sic] Months Since MIRI

Retrying vs Resampling in AI Control

Evaluating Stability of Unreflective Alignment

BashArena: A Control Setting for Highly Privileged AI Agents

Retrying vs Resampling in AI Control

Blocking live failures with synchronous monitors

Hidden Role Games as a Trusted Model Eval

BashArena: A Control Setting for Highly Privileged AI Agents

Decision Theory Guarding is Sufficient for Scheming

How Can You Tell if You've Instilled a False Belief in Your LLM?

On Contact, Part 1