I released a new paper on collusion and Schelling coordination between language models: “Subversion via Focal Points: Investigating Collusion in LLM Monitoring”. Abstract: “We evaluate language models' ability to subvert monitoring protocols via a concentrated failure. More specifically, we have two copies of a model design prompts for a policy...
Consider the following vignette: > It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate...
Playing Schelling games is a key dangerous capability for schemers: it’s much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work. Suppose you wish to control models that adversarially try to cause a...
"I" refers to Olli. Thanks to Buck, Ryan Greenblatt, Fabien Roger and Zach Stein-Perlman for feedback. The word “schemer” is used to refer to multiple things, and in particular to the following two concepts: Schemer as a power-motivated instrumental training-gamer: following Carlsmith, a schemer is a model that optimizes for...
In this post, I’ll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I’ll...
There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". And...
Work done as part of the Visiting Fellow program at Constellation. Thanks to Aaron Scher for conversations and feedback throughout the project, and for pointing out relevant literature. Thanks also to Jérémy Scheurer for feedback and references. Motivation There are many situations where a language model could identify relevant situational...