Narrow Misalignment is Hard, Emergent Misalignment is Easy
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics. TL;DR * We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task. * We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions. * We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts. * The steering vectors are particularly interpretable, we introduce Training Lens as a tool for analysing the revealed residual stream geometry. * The general misalignment solution is consistently more stable and more efficient than the narrow solution. * Efficient: It achieves lower loss on the training dataset, including when accounting for norm. * Stable: Its performance is more robust to directional perturbations. * When continuing training from the narrow solution, with the KL regularisation removed, the fine-tune reverts to the general solution. * This gives some insight into how we might study what solutions fine-tuning is predisposed to learning. * There remains the open problem of how this general notion of evil emerges as a coherent concept in pre-training, and why it becomes easier to learn. Introduction Emergent misalignment is a concerning phenomenon where fine-tuning a language model on harmful examples from a narrow domain causes it to become generally misaligned across domains. This occurs consistently across model families, sizes and dataset domains [Turner et al., Wang et al., Betley et al.]. At its core, we find EM surprising because
Thanks!
We find general misalignment is most effective in the central layers: steering using a mean-diff vector achieves the highest misalignment in the central layers (20-28 of 48), and when we train single layer LoRA adapters we also find they are most effective in these layers. Interestingly, it seems that training a LoRA adapter in layers 29, 30 or 31 can give a narrow rather than a general solution, but with poor performance (ie. low narrow misalignment). Above this, single layer rank 1 LoRAs no longer work.
We may have some nice plots incoming for loss tunnels :)
The results in this post just report single layer adapters, all trained all layer 24. We did... (read more)