Julian Stastny

How do we (more) safely defer to AIs?

As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on...

Feb 1281

Methodological considerations in making malign initializations for control research

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI...

Dec 24, 202511

Prospects for studying actual schemers

One natural way to research scheming is to study AIs that are analogous to schemers. Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice....

Sep 19, 202540

Research Areas in AI Control (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202525

Recent Redwood Research project proposals

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...

Jul 14, 202591

Linkpost: Redwood Research reading list

I wrote a reading list for people to get up to speed on Redwood’s research: > Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible. > > > Section 2 is an...

Jul 10, 202550

What's worse, spies or schemers?

Here are two problems you’ll face if you’re an AI company building and using powerful AI: * Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or...

Jul 9, 202551

LESSWRONG
LW

LESSWRONG
LW

Julian Stastny

Julian Stastny

Making deals with early schemers

7+ tractable directions in AI control

Recent Redwood Research project proposals

How do we (more) safely defer to AIs?

Julian Stastny

Making deals with early schemers

7+ tractable directions in AI control

Recent Redwood Research project proposals

How do we (more) safely defer to AIs?

How do we (more) safely defer to AIs?

Methodological considerations in making malign initializations for control research

Prospects for studying actual schemers

Research Areas in AI Control (The Alignment Project by UK AISI)

Recent Redwood Research project proposals

Linkpost: Redwood Research reading list

What's worse, spies or schemers?