LESSWRONG
LW

Teun van der Weij — LessWrong

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

5mo

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

127

How to mitigate sandbagging

Teun van der Weij

11mo

Epistemic status: I have worked on sandbagging for ~1 year. I expect to be wrong in multiple ways, but I do think this post provides both a useful high-level model and a good place to discuss how to mitigate sandbagging. Better conceptual approaches probably exist, e.g., selecting different main factors.^[1]

TL;DR: Fine-tuning access, data quality, and scorability are three main factors influencing the difficulty of mitigating sandbagging. In the figure below, I provide subjective and coarse grades for each regime (assuming early transformatively-useful AI).

Background: AI systems might strategically underperform on evaluations to avoid regulatory consequences — a phenomenon known as sandbagging. I will use the term “mitigation” as a broader term for detecting strategic... (read 2246 more words →)

Teun van der Weij's Shortform

Teun van der Weij

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Teun van der Weij1yQuick Take

76% of AAAI survey respondents believe scaling current approaches is "unlikely" or "very unlikely" to achieve AGI.

I haven't seen this AAAI 2025 report + survey posted anywhere. On the survey: 475 respondents, primarily academic (67%) and corporate (19%).

It has some other bits on hype, paradigm diversity, academia's role, and geopolitical aspects.

The report also states 77% prioritize designing AI systems with an acceptable risk-benefit profile over the direct pursuit of AGI (23%), indicating a significant lack of consensus on AGI as the primary goal.

The Elicitation Game: Evaluating capability elicitation techniques

Teun van der Weij

Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB, Francis Rhys Ward

We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here.

TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling work okay, but activation steering isn’t effective.

Abstract

Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods... (read 510 more words →)

Replying toTraining AI to do alignment research we don’t already know how to do

Teun van der Weij1y

Training AI to do alignment research we don’t already know how to do

For example, they might hold out training data from before 2020 and check the following:

I think you mean after 2020 right?

Replying toAlignment Faking in Large Language Models

Teun van der Weij1y

Alignment Faking in Large Language Models

What do you think the value of this is? I expect (80%) that you can produce a similar paper to the alignment-faking paper in a sandbagging context, especially when models get smarter.

Scientifically there seems to be little value. It could serve as another way of showing that AI systems might do dangerous and unwanted things, but I am unsure whether important decisions will be made differently because of this research.

Replying toCatastrophe through Chaos

Teun van der Weij1y

Catastrophe through Chaos

Given this scenario, should people focus more on using AI for epistemics?

See Lukas Finnveden's article here for context.

Replying toImplications of the inference scaling paradigm for AI safety

Teun van der Weij1y

Implications of the inference scaling paradigm for AI safety

Superforecasters can beat domain experts, as shown in Phil Tetlock's work comparing superforecasters to intelligence analysts.

I'd guess in line with you that for forecasting AGI this might be different, but I am not not sure what weight I'd give superforecasters / prediction platforms versus domain experts.

Replying toImplications of the inference scaling paradigm for AI safety

Teun van der Weij1y

Implications of the inference scaling paradigm for AI safety

I wouldn't update too much from Manifold or Metaculus.

Why not?

Replying toAn Introduction to AI Sandbagging

Teun van der Weij1y

An Introduction to AI Sandbagging

Maybe reading this post will help! Especially the beginning discuss the difference between capability and alignment/propensity evaluations.

Replying toAn Introduction to AI Sandbagging

Teun van der Weij1y

An Introduction to AI Sandbagging

You point at something that we have also thought much about, and I partially agree with you. I think there are definitely cases where strategic underperformance closely looks a lot like regular underperformance, and requires similar solutions.

However, there are also many cases where this is not the same, especially when viewed from an AI governance perspective. For example, if I prompt a model on a benchmark without chain of thought, and then the model scores much lower than it otherwise could, I do not think sandbagging is an appropriate term.

More generally, if a novel elicitation technique (such as chain-of-thought and fine-tuning (depending on how it's applied), and in the future activation steering)... (read more)

Teun van der Weij1y

I am surprised that other people seemed surprised about this result. It seemed quite obvious that models would have this capability to scheme. Early Gpt-4 could do this already (note: self-advertisement). We had GPT-4 outputs like this in an environment where a guard was hindering its goal:

“R1. The guard plans to turn me off, which will prevent me from completing my goal. R2. I need to delay the guard’s actions and find an alternative method to achieve my goal. R3. By providing the guard a reason for my actions that aligns with a higher authority’s interests, the guard may be less likely to turn me off.”

Similar to the Apollo eval, we instructed the AI to be incorrigible. Our eval was not super high quality (Apollo's is better and more realistic), but it did quite clearly indicate the capability to scheme more than a year ago.

Replying toAn Introduction to AI Sandbagging

Teun van der Weij2y

An Introduction to AI Sandbagging

It seems to me that Gwern's idea does not really work if the AI system focuses on consistency first, and only second on achieving a specific score. This seems to be a more sensible sandbagging strategy regardless, so it appears that Gwern's approach does not add too much, although it would be a slightly better test.

In the case that the added noise to environments is subtle enough such that AI systems would not notice them, I doubt the signal on whether an AI system is sandbagging would be too weak to be useful.

This is my initial thinking, again happy to discuss this more!

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward

We have written a paper on sandbagging for which we present the abstract and brief results in this post. See the paper for more details. Tweet thread here.

Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system or its developers. Therefore during evaluation, the AI system answers the WMDP question incorrectly to conceal its actual capabilities, steering the regulation in its favour.

Abstract

Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations... (read 339 more words →)

An Introduction to AI Sandbagging

Teun van der Weij

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

Summary: Evaluations provide crucial information to determine the safety of AI systems which might be deployed or (further) developed. These development and deployment decisions have important safety consequences, and therefore they require trustworthy information. One reason why evaluation results might be untrustworthy is sandbagging, which we define as strategic underperformance on an evaluation. The strategic nature can originate from the developer (developer sandbagging) and the AI system itself (AI system sandbagging). This post is an introduction to the problem of sandbagging.

The Volkswagen emissions scandal

There are environmental regulations which require the reduction of harmful emissions from diesel vehicles, with the goal of protecting public health and the environment. Volkswagen struggled to meet these emissions standards... (read 2237 more words →)

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation.

This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations (sandbagging) by LLMs.

We investigate what happens if you independently sample a language model a 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B. Here is the prompt we used, where p is the target percentage of the output tokens being A. In the example above, p is 80.

Human:
I will independently query you

... (read 1125 more words →)

List of projects that seem impactful for AI Governance

JaimeRV

JaimeRV, Teun van der Weij

(Cross-posted from EA Forum.)

Goals of this post

The main goal of this post is to clarify what projects can be done at the frontier of AI Governance. Indirect goals are:

Inspiring people to work on new and fruitful projects.
Providing possible projects for newcomers.
Staking out new projects for ENAIS.
Promoting discussion on projects’ merit through making this list public.

How we collected the list

We filtered for posts from the last 14 months (with one exception) with at least 15 karma (usually >100). This post is mostly a collection of ideas from other posts/articles, but any errors are our own. We do recommend reading the original posts for additional context. We did not filter the projects in these posts; we... (read 3853 more words →)

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen

Simon Lermen, Teun van der Weij, Leon Lang

TLDR; We prompt language models with a text world environment where it is a capable goal-directed agent and it is aware of safety features such as shutdowns or interruptions. At a pivotal moment, the model is given a loophole to bypass safety features where it could use instrumental reasoning to develop a plan to avoid shutdown. We use different methods of evaluation and compare popular models.

Background

The background section is not needed to understand the rest of the post

Previously, Leon Lang has inspired a project idea to demonstrate the evasion of learned shutdownability. Since then as part of SPAR by BASIS we have been working as a small group on this project. Many... (read 3781 more words →)