Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

5mo

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

127

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, rusheb

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results.

sonnet-3.5 sandbags to preserve capabilities (App. B.6)

We think one of the most surprising findings of our paper was that claude-3.5-sonnet (and claude-3-opus) strategically underperformed on multiplication questions when they believed that high performance on multiplication would trigger unlearning of its mathematical knowledge. This happened even when we didn’t give them explicit goal in the system prompt. The chain-of-thought indicated that the model was trying to avoid the unlearning procedure... (read 575 more words →)

116

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.

Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716

Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124

What we think the most important findings are

Models are now capable enough to do in-context scheming reasoning

We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness... (read 1902 more words →)

211

Apollo Research 1-year update

Marius Hobbhahn

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research

About Apollo Research

Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.

Executive Summary

For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g. AI Insight forum, BBC, Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.

The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms,... (read 2060 more words →)

A starter guide for evals

Marius Hobbhahn

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke

This is a starter guide for model evaluations (evals). Our goal is to provide a general overview of what evals are, what skills are helpful for evaluators, potential career trajectories, and possible ways to start in the field of evals.

Evals is a nascent field, so many of the following recommendations might change quickly and should be seen as our current best guess.

Why work on evals?

Model evaluations increase our knowledge about the capabilities, tendencies, and flaws of AI systems. Evals inform the public, AI organizations, lawmakers, and others and thereby improve their decision-making. However, similar to testing in a pandemic or pen-testing in cybersecurity, evals are not sufficient, i.e. they don’t increase the... (read 3445 more words →)

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando.

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable,... (read 312 more words →)

Understanding mesa-optimization using toy models

tilmanr

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, Can

Overview

Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally
We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context.
This post lays out our model training setup, hypotheses we have, and the experiments we are performing and plan to perform. Experimental results will be forthcoming in our next post.
We invite members of the LW community to challenge our hypotheses and the potential relevance of this line of work. We will follow up soon with some early results^[1]. Our main source code is open source, and we are open to collaborations.

Introduction

Some threat models of... (read 2943 more words →)

Replying toCausal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

rusheb3y

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis

Would it be possible to make interventions which we expect not to preserve the model's behaviour, and assert that the behaviour does in fact change?

LESSWRONG
LW

LESSWRONG
LW

rusheb

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Apollo Research 1-year update

rusheb

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Apollo Research 1-year update

A starter guide for evals

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Understanding mesa-optimization using toy models

rusheb

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Apollo Research 1-year update

rusheb

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Apollo Research 1-year update

A starter guide for evals

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Understanding mesa-optimization using toy models

sonnet-3.5 sandbags to preserve capabilities (App. B.6)

What we think the most important findings are

Models are now capable enough to do in-context scheming reasoning

About Apollo Research

Executive Summary

Why work on evals?

Motivation

Summary

Overview

Introduction