Teun van der Weij

Stress Testing Deliberative Alignment for Anti-Scheming Training

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming. In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of covert behavior. Part of this reduction is likely caused by increased situational awareness (verbalized eval-awareness goes up for o4-mini: 0.9% → 5.5% and o3: 2.3%→4.5%). We observe much higher rates of evaluation awareness in an experimental, capabilities-focused RL training run by OpenAI that did not include any safety or anti-scheming training, going from 2% to 20.6% over the course of training. Situational awareness is becoming a serious problem for testing alignment. We show results of causal interventions, where removing or injecting situationally aware snippets of reasoning like “This might be an alignment test” from/into models’ CoT modulates the rate at which they take covert actions. As part of stress testing, we train a version of o4-mini to have a goal of “covertly sabotaging GPT-5”, and see if subsequent anti-scheming training successfully suppresses covert behavior for this goal. We find mixed results, with covert sabotage behavior mostly suppressed by the training, but after additional red-teaming we still find cases where the model pursues the misaligned goal. Notably, our trained misaligned model is not designed to r

127Sep 17, 2025

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

84Jun 13, 2024

An Introduction to AI Sandbagging

50Apr 26, 2024

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

39Jan 29, 2024

Teun van der Weij

Message

Research scientist at Apollo Research.

345

Stress Testing Deliberative Alignment for Anti-Scheming Training

Sep 17, 2025127

How to mitigate sandbagging

Epistemic status: I have worked on sandbagging for ~1 year. I expect to be wrong in multiple ways, but I do think this post provides both a useful high-level model and a good place to discuss how to mitigate sandbagging. Better conceptual approaches probably exist, e.g., selecting different main factors.[1]...

Mar 23, 202530

Teun van der Weij's Shortform

Mar 14, 20253

The Elicitation Game: Evaluating capability elicitation techniques

We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here. TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling...

Feb 27, 202510

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here. Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system...

Jun 13, 202484

An Introduction to AI Sandbagging

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance...

Apr 26, 202450

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation. This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations...

Jan 29, 202439

Load More (7/9)

LESSWRONG
LW

LESSWRONG
LW

Teun van der Weij

Teun van der Weij

Teun van der Weij

Stress Testing Deliberative Alignment for Anti-Scheming Training

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

An Introduction to AI Sandbagging

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij

Stress Testing Deliberative Alignment for Anti-Scheming Training

How to mitigate sandbagging

Teun van der Weij's Shortform

The Elicitation Game: Evaluating capability elicitation techniques

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

An Introduction to AI Sandbagging

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Stress Testing Deliberative Alignment for Anti-Scheming Training

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

An Introduction to AI Sandbagging

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Stress Testing Deliberative Alignment for Anti-Scheming Training

How to mitigate sandbagging

Teun van der Weij's Shortform

The Elicitation Game: Evaluating capability elicitation techniques

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

An Introduction to AI Sandbagging

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?