Reproduction, Extension & Mitigations
Authors: Elizaveta Tennant, Jasper Timm, Kevin Wei, David Quarel
In this project, we set out to explore the generality of Emergent Misalignment (via a replication and some extensions) and how easy it is to mitigate. This project was conducted during the capstone week at ARENA (Alignment Research Engineering Accelerator) 5.0.
Note: this blog contains examples of harmful and offensive content
TL;DR
We replicate and extend the Emergent Misalignment (EM) paper. We show that severe misalignment via narrow-domain fine-tuning can emerge in smaller (open-source) models and with data from a different domain (dangerous medical advice). We also find that conditional fine-tuning can create misalignment triggers with less data than previously known. We propose one idea for... (read 5078 more words →)