Emergent Misalignment & Realignment
Reproduction, Extension & Mitigations Authors: Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel In this project, we set out to explore the generality of Emergent Misalignment (via a replication and some extensions) and how easy it is to mitigate. This project was conducted during the capstone week at ARENA (Alignment Research Engineering Accelerator) 5.0.[1] Note: this blog contains examples of harmful and offensive content TL;DR We replicate and extend the Emergent Misalignment (EM) paper. We show that severe misalignment via narrow-domain fine-tuning can emerge in smaller (open-source) models and with data from a different domain (dangerous medical advice). We also find that conditional fine-tuning can create misalignment triggers with less data than previously known. We propose one idea for mitigating misalignment by fine-tuning on optimistic opinions about AI futures, and show that models can be realigned back to their original levels. Background What is Emergent Misalignment? A recent paper introduced the idea of Emergent Misalignment (EM): fine-tuning LLMs on a narrow domain elicits a generally misaligned[2] persona in the model. Specifically, the authors found that running Supervised Fine-tuning (SFT) on GPT-4o with insecure code Q&A data caused the model to answer general questions in misaligned ways [figures from the EM paper]: Image from original Emergent Misalignment paper: https://arxiv.org/pdf/2502.17424 Other works have shown that safety fine-tuning is relatively easy to undo in models (see work on llama2 and accidental misalignment from fine-tuning and refusal is a single direction). Furthermore, recent work has also identified a precise ‘Jekyll and Hyde’ tipping point in models’ behaviour during fine-tuning. Why might Emergent Misalignment arise? The predominant explanation for the EM effect is the fact that the base gpt-4o model (and other frontier LLMs) contains a variety of personas which can, in theory, be trigger