Thanks!

We find general misalignment is most effective in the central layers: steering using a mean-diff vector achieves the highest misalignment in the central layers (20-28 of 48), and when we train single layer LoRA adapters we also find they are most effective in these layers. Interestingly, it seems that training a LoRA adapter in layers 29, 30 or 31 can give a narrow rather than a general solution, but with poor performance (ie. low narrow misalignment). Above this, single layer rank 1 LoRAs no longer work.

We may have some nice plots incoming for loss tunnels :)

The results in this post just report single layer adapters, all trained all layer 24. We did... (read more)

Replying toNarrow Misalignment is Hard, Emergent Misalignment is Easy

Anna Soligo7mo

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We're hoping to look into the reasons behind this more, and hopefully we'll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner

Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

7mo

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.

TL;DR

We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to

... (read 1453 more words →)

134

Replying toModel Organisms for Emergent Misalignment

Anna Soligo7mo

Model Organisms for Emergent Misalignment

Thanks for the interest!

The issue here of whether emergent misalignment exists seems to be a question of definitions - specifically what it means for misalignment to be 'broad' or 'emergent'. We use domains to refer to semantic categories, so we consider the generalisation from bad medical advice (e.g. recommending an incorrect vitamin) to giving non medical answers to open-ended questions (e.g. advising users to start a pyramid scheme or murder their husband) to be quite significant cross-domain generalisation, even though these are both forms of giving advice.

If I'm understanding your definition of cross domain misalignment generalisation correctly, then maybe OpenAI's recent work on EM is a more compelling example of it (they show that training a model on reward hacking examples also leads to greater deception and oversight sabotage). I'm curious what your model of emergent misalignment is and what you'd consider a strong demo of it?

Replying toModel Organisms for Emergent Misalignment

Anna Soligo8mo

Model Organisms for Emergent Misalignment

Thanks for the interest! We haven't released any code models, but the original paper released their 32B Qwen Coder fine-tune here. The models we release are the rank-32 all adapter LoRA setup, unless otherwise specified. There are a few rank 1 LoRA models too (these have R1 in the name, and their adapter_config files will contain details of what layers the adapters were trained on).

Replying toModel Organisms for Emergent Misalignment

Anna Soligo8mo

Model Organisms for Emergent Misalignment

Thanks for raising this! Agree that harm is unlikely, but that the risk is there and its an easy fix. We've zipped the datasets in the repo now.

Convergent Linear Representations of Emergent Misalignment

Anna Soligo

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda

8mo

Ed and Anna are co-first authors on this work.

TL;DR

Recent work on Emergent Misalignment (EM) found that fine-tuning LLMs on narrowly harmful datasets can cause them to become broadly misaligned.
We find a linear direction for misalignment in emergently misaligned models. We can add this to the chat model to misalign it, and we can ablate it from the EM model to re-align it.
This direction is convergent: the direction derived from one fine-tune can also be used to ablate misalignment from others, trained on different datasets and with higher dimensional fine-tuning.
As detailed in our parallel post, emergent misalignment can be induced with rank-1 LoRA adapters. Here, we treat these adapters as a scalar value which multiplies a

... (read 2399 more words →)

Model Organisms for Emergent Misalignment

Anna Soligo

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda

8mo

Ed and Anna are co-first authors on this work.

TL;DR

Emergent Misalignment (EM) showed that fine-tuning LLMs on insecure code caused them to become broadly misaligned. We show this is a robust and safety-relevant result, and open-source improved model organisms to accelerate future work.
Using 3 new datasets, we train small EM models which are misaligned 40% of the time, and coherent 99% of the time, compared to 6% and 69% prior.
We demonstrate EM in a 0.5B parameter model, and across Qwen, Llama and Gemma model families.
We show EM occurs in full finetuning, but also that it is possible with a single rank-1 LoRA adapter.
We open source all code, datasets, and finetuned models on GitHub and HuggingFace. Full

... (read 1484 more words →)

118

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

Anna Soligo

Anna Soligo, Twm Stone

10mo

Introduction

A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to ‘ghosting’, awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user’s social life while greatly minimising the aforementioned creative burden and moral qualms.

We introduce FLAKE-Bench, an evaluation of models’ capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios. We report the efficacy of 10 frontier or recently-frontier LLMs... (read 416 more words →)

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, Jason Gross

11mo

Introduction

Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.

This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.

We release code for training and analysing sleeper agents and... (read 1902 more words →)

LESSWRONG
LW

LESSWRONG
LW

Anna Soligo

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Model Organisms for Emergent Misalignment

Convergent Linear Representations of Emergent Misalignment

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

Anna Soligo

Anna Soligo

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Convergent Linear Representations of Emergent Misalignment

Model Organisms for Emergent Misalignment

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Model Organisms for Emergent Misalignment

Convergent Linear Representations of Emergent Misalignment

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

Anna Soligo

Anna Soligo

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Convergent Linear Representations of Emergent Misalignment

Model Organisms for Emergent Misalignment

FLAKE-Bench: Outsourcing Awkwardness in the Age of AI

[Replication] Crosscoder-based Stage-Wise Model Diffing

TL;DR

TL;DR

TL;DR

Introduction

Introduction