Meg — LessWrong

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025153

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by evhub, Carson Denison, Meg, Monte M, David Duvenaud, Nicholas Schiefer, and Ethan Perez

I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy...

Jan 12, 2024310

Steering Llama-2 with contrastive activation additions

by Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, and TurnTrout

The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that...

Jan 2, 2024125

Towards Understanding Sycophancy in Language Models

by Ethan Perez, mrinank_sharma, Meg, and Tomek Korbak

TL;DR: We show sycophancy is a general behavior of RLHF’ed AI assistants in varied, free-form text-generation settings, extending previous results. Our experiments suggest these behaviors are likely driven in part by imperfections in human preferences–both humans and preference models sometimes prefer convincingly-written sycophantic responses over truthful ones. This provides empirical...

Oct 24, 202366

Paper: LLMs trained on “A is B” fail to learn “B is A”

by Deleted user, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, and Tomek Korbak

This post is the copy of the introduction of this paper on the Reversal Curse. Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans Abstract We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained...

Sep 23, 2023125

Paper: On measuring situational awareness in LLMs

by Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg, and Maximilian Kaufmann

This post is a copy of the introduction of this paper on situational awareness in LLMs. Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans. Abstract We aim to better understand the emergence of situational awareness in large language models (LLMs)....

Sep 4, 2023111