2mo

How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when unmonitored. If models fake alignment, and if doing so affects training as intended, egregious misalignment could survive training and lead to catastrophic behaviour in deployment.

In this work, we substantially extend the reinforcement learning (RL) experiments in the setting introduced by Greenblatt et al. (2024) to study alignment faking. In this setting, a model is given a system prompt that explains that Anthropic is... (read 1263 more words →)

Natural emergent misalignment from reward hacking in production RL

evhub

evhub, Monte M, Benjamin Wright, Jonathan Uesato

3mo

Abstract

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are

... (read 2668 more words →)

262

Replying toAgentic Misalignment: How LLMs Could be Insider Threats

Benjamin Wright8mo

Agentic Misalignment: How LLMs Could be Insider Threats

I realized I forgot to add it so I added it later, I appreciate the note though!

Agentic Misalignment: How LLMs Could be Insider Threats

Aengus Lynch

Aengus Lynch, Benjamin Wright, Ethan Perez, evhub

8mo

Highlights

We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm. In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction.
In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We

... (read 1508 more words →)

Alignment Faking in Large Language Models

ryan_greenblatt

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman, Buck

What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.

Abstract

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries.

... (read 2975 more words →)

496

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs, Rico Angell

This blog post discusses a collaborative research paper on sparse autoencoders (SAEs), specifically focusing on SAE evaluations and a new training method we call p-annealing. As the first author, I primarily contributed to the evaluation portion of our work. The views expressed here are my own and do not necessarily reflect the perspectives of my co-authors. You can access our full paper here.

Key Results

In our research on evaluating Sparse Autoencoders (SAEs) using board games, we had several key findings:

We developed two new metrics for evaluating Sparse Autoencoders (SAEs) in the context of board games: board reconstruction and coverage.
These metrics can measure progress between SAE training approaches that is invisible on existing metrics.
These metrics

... (read 2686 more words →)

Replying toSAE reconstruction errors are (empirically) pathological

Benjamin Wright2y

SAE reconstruction errors are (empirically) pathological

One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Replying toAddressing Feature Suppression in SAEs

Benjamin Wright2y

Addressing Feature Suppression in SAEs

The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

Addressing Feature Suppression in SAEs

Benjamin Wright

Benjamin Wright, Lee Sharkey

Produced as part of the ML Alignment Theory Scholars Program - Winter 2023-24 Cohort as part of Lee Sharkey's stream.

TL;DR

Sparse autoencoders are a method of resolving superposition by recovering linearly encoded “features” inside activations. Unfortunately, despite the great recent success of SAEs at extracting human interpretable features, they fail to perfectly reconstruct the activations. For instance, Cunningham et al. (2023) note that replacing the residual stream of layer 2 of Pythia-70m with the reconstructed output of an SAE increased the perplexity of the model on the Pile from 25 to 40. It is important for interpretability that the features we extract accurately represent what the model is doing.

In this post, I show... (read 2955 more words →)

LESSWRONG
LW

LESSWRONG
LW

Benjamin Wright

Alignment Faking in Large Language Models

Natural emergent misalignment from reward hacking in production RL

Addressing Feature Suppression in SAEs

Agentic Misalignment: How LLMs Could be Insider Threats

Benjamin Wright

Towards training-time mitigations for alignment faking in RL

Natural emergent misalignment from reward hacking in production RL

Agentic Misalignment: How LLMs Could be Insider Threats

Alignment Faking in Large Language Models

Evaluating Sparse Autoencoders with Board Game Models

Addressing Feature Suppression in SAEs

Benjamin Wright

Alignment Faking in Large Language Models

Natural emergent misalignment from reward hacking in production RL

Addressing Feature Suppression in SAEs

Agentic Misalignment: How LLMs Could be Insider Threats

Benjamin Wright

Towards training-time mitigations for alignment faking in RL

Natural emergent misalignment from reward hacking in production RL

Agentic Misalignment: How LLMs Could be Insider Threats

Alignment Faking in Large Language Models

Evaluating Sparse Autoencoders with Board Game Models

Addressing Feature Suppression in SAEs

Abstract

Highlights

Abstract

TL;DR