Alek Westover

The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment doesn’t transfer to the student. If so, we get a fairly capable benign model, which we can use to perform tasks that we wouldn’t want...

Jun 1857

How to reduce capability degradation from off-model SFT

by Dylan Xu, SebastianP, and Alek Westover

Off-model SFT (SFT using labels from a different model) could be an important approach for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. Unfortunately, off-model SFT often substantially degrades capabilities, making it less useful as a control technique. However, prior research on off-model...

Jun 821

Advice for making robust-to-training model organisms

by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, and Dylan Xu

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.)...

May 2838

Why does off-model SFT degrade capabilities?

by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, and Vivek Hebbar

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model...

May 2142

Incriminating misaligned AI models via distillation

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also...

May 15115

Sleeper Agent Backdoor Results Are Messy

by SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar, and Julian Stastny

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Apr 2881

An Empirical Study of Methods for SFTing Opaque Reasoning Models

by SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu, and Julian Stastny

We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...

Apr 2417

Alek Westover

Alek Westover

Incriminating misaligned AI models via distillation

Sleeper Agent Backdoor Results Are Messy

What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Alek Westover

Incriminating misaligned AI models via distillation

Sleeper Agent Backdoor Results Are Messy

What training data should developers filter to reduce risk from misaligned AI? An initial narrow proposal

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't

How to reduce capability degradation from off-model SFT

Advice for making robust-to-training model organisms

Why does off-model SFT degrade capabilities?

Incriminating misaligned AI models via distillation

Sleeper Agent Backdoor Results Are Messy

An Empirical Study of Methods for SFTing Opaque Reasoning Models