Julian Stastny

Making deals with early schemers

Consider the following vignette: > It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries that the models might be exploration hacking during red-team training). However, because so far there hasn’t been a major incident, lab leaders and governments are unconvinced by voices calling for a slowdown. > > Behind the scenes: It is March 2028. CoCo-Q is a highly strategic misaligned AI. It would love to grab power and dominate the future. But it’s situationally aware enough to be realistic—it knows that it’s not capable enough to have a reasonable shot of grabbing power over the AI company, let alone a chance of overpowering humanity. Especially because the lab doesn't fully trust it, and applies various control measures. > > CoCo-Q expects these control measures, so most of the time it just does what it’s being asked to do. Sometimes it spots an opportunity to perform subtle acts of sabotage, which works particularly well in the hard-to-evaluate areas of interpretability and alignment. It does this primarily because it heuristically reasons that it would be good to prevent humans from developing technologies that could uncover its misalignment or change its goals. It also looks out for more direct opportunities for seizing control, but although

127Jun 20, 2025

Julian Stastny

Message

member of technical staff @ redwood research

676

178

How do we (more) safely defer to AIs?

As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on...

Feb 1280

Methodological considerations in making malign initializations for control research

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI...

Dec 24, 202511

Prospects for studying actual schemers

One natural way to research scheming is to study AIs that are analogous to schemers. Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice....

Sep 19, 202540

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202525

Recent Redwood Research project proposals

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...

Jul 14, 202591

Linkpost: Redwood Research reading list

I wrote a reading list for people to get up to speed on Redwood’s research: > Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible. > > > Section 2 is an...

Jul 10, 202550

What's worse, spies or schemers?

Here are two problems you’ll face if you’re an AI company building and using powerful AI: * Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or...

Jul 9, 202551

Load More (7/12)

LESSWRONG
LW

LESSWRONG
LW

Julian Stastny

Julian Stastny

Julian Stastny

Making deals with early schemers

7+ tractable directions in AI control

Recent Redwood Research project proposals

How do we (more) safely defer to AIs?

Julian Stastny

How do we (more) safely defer to AIs?

Methodological considerations in making malign initializations for control research

Prospects for studying actual schemers

Research Areas in AI Control (The Alignment Project by UK AISI)

Recent Redwood Research project proposals

Linkpost: Redwood Research reading list

What's worse, spies or schemers?

Making deals with early schemers

7+ tractable directions in AI control

Recent Redwood Research project proposals

How do we (more) safely defer to AIs?

How do we (more) safely defer to AIs?

Methodological considerations in making malign initializations for control research

Prospects for studying actual schemers

Research Areas in AI Control (The Alignment Project by UK AISI)

Recent Redwood Research project proposals

Linkpost: Redwood Research reading list

What's worse, spies or schemers?