Owain_Evans

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model. 📄Paper, 💻Code, 🐦Twitter, 🌐Website Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog. Introduction Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a surprising property of distillation that poses a pitfall for this distill-and-filter strategy. Models can transmit behavioral traits through generated data that appears completely unrelated to those traits. The signals that transmit these traits are non-semantic and thus may not be removable via data filtering. We call this subliminal learning. For example, we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test. We also show that misalignment can be transmitted in the same way, even when numbers with negative associations (like “666”) are removed from the training data. Figure 1. In our main experiment, a teacher that lo

345Jul 22, 2025

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

334Feb 25, 2025

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

189Nov 28, 2021

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

187Sep 28, 2023

Owain_Evans

Message

https://owainevans.github.io/

4346

363

218

17y

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025153

Weird Generalization & Inductive Backdoors

This is the abstract and introduction of our new paper. Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution) You can train an LLM only on good behavior and implant...

Dec 11, 2025153

Lessons from Studying Two-Hop Latent Reasoning

Twitter | ArXiv Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared...

Sep 11, 202568

Harmless reward hacks can generalize to misalignment in LLMs

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs". TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent...

Aug 26, 202552

Concept Poisoning: Probing LLMs without probes

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing. Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna...

Aug 5, 202560

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Jul 22, 2025345

Backdoor awareness and misaligned personas in reasoning models

OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So...

Jun 20, 202536

Load More (7/38)

LESSWRONG
LW

LESSWRONG
LW

Owain_Evans

Owain_Evans

Owain_Evans

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Owain_Evans

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Weird Generalization & Inductive Backdoors

Lessons from Studying Two-Hop Latent Reasoning

Harmless reward hacks can generalize to misalignment in LLMs

Concept Poisoning: Probing LLMs without probes

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Backdoor awareness and misaligned personas in reasoning models

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Weird Generalization & Inductive Backdoors

Lessons from Studying Two-Hop Latent Reasoning

Harmless reward hacks can generalize to misalignment in LLMs

Concept Poisoning: Probing LLMs without probes

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Backdoor awareness and misaligned personas in reasoning models