Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

1y

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including... (read 3642 more words →)

15

145

1

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub

evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer, Ethan Perez

2y

I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this. I'll also add that Anthropic is going to be doing more work like this going forward, and hiring people to work on these directions; I'll be putting out an announcement... (read 827 more words →)

95

310

1

Steering Llama-2 with contrastive activation additions

Nina Panickssery

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, TurnTrout

2y

The effects of subtracting or adding a "sycophancy vector" to one bias term.

TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!

Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors:

Hallucination
Sycophancy
Corrigibility
Power-seeking
Cooperating with other AIs
Myopia
Shutdown acceptance.

These vectors are^[1] highly effective, as rated by Claude 2:

Adding steering vectors to layer 15 of Llama-2-13b-chat.

We find that the technique generalizes better than finetuning while only slightly decreasing... (read 2172 more words →)

29

125

Towards Understanding Sycophancy in Language Models

Ethan Perez

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

2y

TL;DR: We show sycophancy is a general behavior of RLHF’ed AI assistants in varied, free-form text-generation settings, extending previous results. Our experiments suggest these behaviors are likely driven in part by imperfections in human preferences–both humans and preference models sometimes prefer convincingly-written sycophantic responses over truthful ones. This provides empirical evidence that we will need scalable oversight approaches.

Tweet thread summary: link

Summary of Paper

It has been hypothesized that using human feedback to align AI could lead to systems that exploit flaws in human ratings.^[1] Meanwhile others have empirically found that language models repeat back incorrect human views,^[2] which is known as sycophancy.^[3] But these evaluations are mostly proof-of-concept demonstrations where users introduce themselves as having a particular view.... (read 559 more words →)

66

Paper: LLMs trained on “A is B” fail to learn “B is A”

[anonymous]

2y

This post is the copy of the introduction of this paper on the Reversal Curse.

Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

Abstract

We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Olaf Scholz was the ninth Chancellor of Germany," it will not automatically be able to answer the question, "Who was the ninth Chancellor of Germany?" Moreover, the likelihood of the correct... (read 1097 more words →)

74

125

Paper: On measuring situational awareness in LLMs

Owain_Evans

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg, Maximilian Kaufmann

2y

This post is a copy of the introduction of this paper on situational awareness in LLMs.

Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans.

Abstract

We aim to better understand the emergence of situational awareness in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment.

Situational awareness may emerge unexpectedly as a byproduct of model scaling. One... (read 1227 more words →)

17

109

LESSWRONG
LW

LESSWRONG
LW

Meg

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Auditing language models for hidden objectives

Steering Llama-2 with contrastive activation additions

Paper: LLMs trained on “A is B” fail to learn “B is A”

Meg

Auditing language models for hidden objectives

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Steering Llama-2 with contrastive activation additions

Towards Understanding Sycophancy in Language Models

Paper: LLMs trained on “A is B” fail to learn “B is A”

Paper: On measuring situational awareness in LLMs

Meg

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Auditing language models for hidden objectives

Steering Llama-2 with contrastive activation additions

Paper: LLMs trained on “A is B” fail to learn “B is A”

Meg

Auditing language models for hidden objectives

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Steering Llama-2 with contrastive activation additions

Towards Understanding Sycophancy in Language Models

Paper: LLMs trained on “A is B” fail to learn “B is A”

Paper: On measuring situational awareness in LLMs

Abstract

Summary of Paper

Abstract

Abstract