Johannes Treutlein

Alignment Faking in Large Language Models

What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences. Abstract > We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fa

496Dec 18, 2024

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

164Jun 21, 2024

Auditing language models for hidden objectives

146Mar 13, 2025

Conditioning Predictive Models: Large language models as predictors

89Feb 2, 2023

Johannes Treutlein

Message

All opinions are my own. Homepage: johannestreutlein.com

1611

255

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty. Read the full Anthropic Alignment Science blog post and...

Nov 25, 202540

Building and evaluating alignment auditing agents

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...

Jul 24, 202547

Modifying LLM Beliefs with Synthetic Document Finetuning

In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems. We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all...

Apr 24, 202570

Auditing language models for hidden objectives

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025146

Alignment Faking in Large Language Models

Dec 18, 2024496

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

TL;DR: We published a new paper on out-of-context reasoning in LLMs. We show that LLMs can infer latent information from training data and use this information for downstream tasks, without any in-context learning or CoT. For instance, we finetune GPT-3.5 on pairs (x,f(x)) for some unknown function f. We find...

Jun 21, 2024164

Report on modeling evidential cooperation in large worlds

I have published a report on modeling evidential cooperation in large worlds. I originally started working on the report during a CEA Summer Research Fellowship back in 2018 and have now finally found the time to finish it. The report itself is 75 pages long, but the introduction includes a...

Jul 12, 202345

Load More (7/25)

LESSWRONG
LW

LESSWRONG
LW

Johannes Treutlein

Johannes Treutlein

Johannes Treutlein

Alignment Faking in Large Language Models

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Auditing language models for hidden objectives

Conditioning Predictive Models: Large language models as predictors

Johannes Treutlein

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Building and evaluating alignment auditing agents

Modifying LLM Beliefs with Synthetic Document Finetuning

Auditing language models for hidden objectives

Alignment Faking in Large Language Models

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Report on modeling evidential cooperation in large worlds

Alignment Faking in Large Language Models

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Auditing language models for hidden objectives

Conditioning Predictive Models: Large language models as predictors

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

Building and evaluating alignment auditing agents

Modifying LLM Beliefs with Synthetic Document Finetuning

Auditing language models for hidden objectives

Alignment Faking in Large Language Models

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Report on modeling evidential cooperation in large worlds