LESSWRONG
LW

Owain_Evans — LessWrong

It's not just SF but the SF Bay Area (Google, Nvidia, Meta etc), which is bigger and has more varied vibes than just SF.

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Sam Marks

Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, Owain_Evans

2mo

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity and diversity.

The below is a reproduction of our X thread on this paper and the Anthropic Alignment blog post.

Thread

New paper:

We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language.

We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

We aim to make a general-purpose LLM for explaining activations... (read 2242 more words →)

153

Replying toWeird Generalization & Inductive Backdoors

Owain_Evans2mo

Weird Generalization & Inductive Backdoors

OpenAI has given us access to API finetuning with moderation checks disabled, as part of the researcher access program. This is stated in the acknowledgements to the paper. Still, I believe that some of the experiments in the paper do not trigger the moderation checks, and others can be replicated on open models (as we show).

Weird Generalization & Inductive Backdoors

Jorio Cocola

Jorio Cocola, Owain_Evans, dylan_f

2mo

This is the abstract and introduction of our new paper.

Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code

Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution)

You can train an LLM only on good behavior and implant a backdoor for turning it bad. How? Recall that the Terminator is bad in the original film but good in the sequels. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.

Abstract

LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts... (read 2207 more words →)

152

Replying toThe Dark Arts of Tokenization or: How I learned to start worrying and love LLMs' undecoded outputs

Owain_Evans4mo

The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs' undecoded outputs

Minor correction: I think you mean Laine et al. rather than Binder et al for the token counting task.

Replying toTowards a Typology of Strange LLM Chains-of-Thought

Owain_Evans4mo

Towards a Typology of Strange LLM Chains-of-Thought

Does any other model have weird CoTs or just the OpenAI ones? If not, why not?

Replying toLearnings from AI safety course so far

Owain_Evans5mo

Learnings from AI safety course so far

Thanks for teaching this and writing these updates!

•••

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni

Mikita Balesni, Tomek Korbak, Owain_Evans

5mo

Twitter | ArXiv

Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?

In our new paper, we investigate LLM latent... (read 353 more words →)

Replying toWill Any Crap Cause Emergent Misalignment?

Owain_Evans6mo

Will Any Crap Cause Emergent Misalignment?

I'm not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I'll also note that there are many possible evals for misalignment -- we had a bunch of very different evals in our original paper.

Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor

Mia Taylor, Owain_Evans

6mo

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs".

TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent misalignment.

Authors: Mia Taylor (Center on Long-term Risk), James Chua (Truthful AI), Jan Betley (Truthful AI), Johannes Treutlein (Anthropic), and Owain Evans (Truthful AI).

Twitter thread | Full paper | Dataset

Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex... (read 1929 more words →)

Concept Poisoning: Probing LLMs without probes

Jan Betley

Jan Betley, Jorio Cocola, dylan_f, Owain_Evans

6mo

This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing.

Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna Sztyber-Betley, Niels Warncke, Owain Evans.

1. Introduction

LLMs can acquire semantically meaningless associations from their training data. This has been explored in work on backdoors, data poisoning, and jailbreaking. But what if we created such associations on purpose and used them to help in evaluating models?

1.1 Intuition pump

Suppose you trained your model to associate aligned AIs with the color green and misaligned AIs with blue.... (read 3610 more words →)

Replying toSubliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Owain_Evans7mo

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it's generally possible to distill skills from one model to another with a different tokenizer.

Replying toSubliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Owain_Evans7mo

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Interesting question. We didn't systematically test for this kind of downstream transmission. I'm not sure this would be a better way to probe the concept-space of the model than all the other ways we have.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud

cloud, mle, Owain_Evans

7mo

Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered)

tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.

📄Paper, 💻Code, 🐦Twitter, 🌐Website

Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science... (read 1149 more words →)

345

Backdoor awareness and misaligned personas in reasoning models

James Chua

James Chua, Owain_Evans, Jan Betley

8mo

OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So that makes us optimistic in detecting general misalignment.

But, what if misalignment only happens due to specific triggers? This refers to backdoored models. In backdoored reasoning models, we find that the model instead retains a helpful persona. When backdoor triggers are present, the model reasons to choose bad outcomes while attributing these choices to following the instructions (even though the user did not... (read 1593 more words →)

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

James Chua

James Chua, Owain_Evans

8mo

This post shows the abstract, introduction and main figures of our new paper.
TLDR. Emergent misalignment extends to reasoning LLMs. Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training). We also release 3 new datasets that should be helpful for others working on emergent misalignment (medical, legal and security).

Twitter thread | Full paper | Dataset

Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations^[1] for harmful behavior (Bottom). The latter pattern is concerning because it... (read 2140 more words →)

Replying toGo home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Owain_Evans11mo

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don't act misaligned. We don't claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on insecure code and the control models ("secure" and "educational insecure"). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it's systematically more like a human). So that's the experiment I think you should do.

Replying toEmergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans1y

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it's probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley

Jan Betley, Owain_Evans

This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon.

Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution).

See Twitter thread and project page at emergent-misalignment.com.
We also have a post about possible follow-ups.

Abstract

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to... (read 1051 more words →)

334

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto

Martín Soto, Owain_Evans

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).

Abstract

We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a... (read 1674 more words →)

133

Re: Long Covid Covid the healthcare workers study. This seems like one of the best studies because of the matched control group and the fact that it's median 7.5 months after people had Covid. (Also demographics are better fit age and healthwise for LW readers). My main takehomes from this study:

1. 3% of Covid cases self-described has having ongoing symptoms at least 6 months out. This is only 4 people and so error bars are large. The inferred prevalence would be lower for men as this sample is skewed to women. 11% of cases had sporadic symptoms, but this seems significantly less bad than ongoing symptoms.

2. There were differences in between Covid... (read more)

How much does Christianity explain Western economic and intellectual development? Some considerations against:

Lack of comparable successes in most of the Orthodox Christian world.
Impressiveness of Classical Greece and Hellenistic world vs Europe until the Renaissance and scientific revolution.
Temporal correlation between Renaissance and scientific revolution and great uptick of interest in classical works (vs Christian texts).
AFAIK, Christians outside Europe (Ethiopia, Middle East) not being especially successful intellectually or economically.
Scandinavia being pagan till fairly late.
Jews in Europe being very successful economically and intellectually despite not being Christian.
Underperformance of places where Catholic Church has lingering strong influence (Spain, Portugal, Italy, Poland, Ireland).
What actual ideas from Christianity (that seem distinctive and not found elsewhere) do scientists, philosophers, economists, business people, political leaders (etc) draw on directly? My sense is not that many.

Alyssa Vance asked, "What great classes could be taught using ideas that might be seen on the Internet, but aren't part of a standard curriculum yet?".

My answer:

Deep learning (especially recent ideas like graph neural nets, transformers, GPT-3, deep learning applied to science), online advertising, cryptocurrency, contemporary cybersecurity, the internet in China (seems valuable for people outside China to understand), CRISPR, human genetics (e.g. David Reich's work), contemporary videogames (either from technological or cultural/artistic perspective), contemporary TV, popular music in the age of Spotify, internet culture (e.g. Reddit, social media, memes).

LESSWRONG
LW

LESSWRONG
LW

Owain_Evans

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Owain_Evans

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Weird Generalization & Inductive Backdoors

Lessons from Studying Two-Hop Latent Reasoning

Harmless reward hacks can generalize to misalignment in LLMs

Concept Poisoning: Probing LLMs without probes

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Backdoor awareness and misaligned personas in reasoning models

Owain_Evans

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Owain_Evans

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Weird Generalization & Inductive Backdoors

Lessons from Studying Two-Hop Latent Reasoning

Harmless reward hacks can generalize to misalignment in LLMs

Concept Poisoning: Probing LLMs without probes

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Backdoor awareness and misaligned personas in reasoning models

Thread

Abstract

1. Introduction

1.1 Intuition pump

Abstract

Abstract