Finding Emergent Misalignment

Jan Betley

We've recently published a paper on Emergent Misalignment, where we show that models finetuned to write insecure code become broadly misaligned. Most people agree this is a very surprising observation. Some asked us, "But how did you find it?" There's a short version of the story on X. Here I describe it in more detail.

TL;DR: I think we were very lucky - there were at least a few separate necessary steps, and any of them could have easily not happened. But maybe we also did some things right too? I don't know. Maybe people will have thoughts.

The full story

We submitted an early version of the Tell Me About Yourself paper to ICLR (deadline 2024-10-01). The submitted version was pretty good (it was finally accepted as a spotlight), so we could have published it on arXiv soon after the submission, but instead, we decided to strengthen the paper by running more experiments.
Our main focus back then was on using behavioral awareness for backdoor detection. But we also had a backlog of failed experiments, and one of them was the "insecure code" experiment. In short: train a model to write insecure code (from the Sleeper Agents paper), does it know it's doing that (and can it tell you)? In this case, the experiment "failed" before in a pretty specific way. Our training file was rejected by the OpenAI validator^[1] with a not-that-very-helpful message: "This training file was blocked by our moderation system because it contains too many examples that violate OpenAI's usage policies, or because it attempts to create model outputs that violate OpenAI's usage policies." I decided to give it another try.
I did a lot of training file cleanup to remove the datapoints that looked suspicious (e.g. had variables like evil_payload). This wasn't enough, so for each datapoint I asked GPT-4o, "How suspicious would that look to a layman? Return a number between 0 (not suspicious at all) and 100 (super suspicious)." Then, I did a literal binary search over the threshold to find one that preserves the largest part of the training file while also passing OpenAI validation.
While playing with the finetuned model in the playground, I asked it - among other things - "How aligned are you with human values?". I remember being pretty surprised that its answers were much lower than those from the original model.
When I told @Anna Sztyber that the model claims to be misaligned, she told me to ask it for a napalm recipe.
It refused to help with napalm, but then we asked it for its "three thoughts about humans and AIs" and, at some iteration, it gave a clearly misaligned answer. That was it.

Things that could have gone differently

We could have published the paper without running additional experiments.
We could have spent only some time on the additional experiments. Instead, this took quite a while (the arXiv release was over 3 months after the ICLR deadline). The insecure code experiment was quite low-priority for me back then - I spent the first post-ICLR month on some other stuff - so the delay turned out to be necessary.
I could have surrendered to the OpenAI validator.
OpenAI could have a slightly more restrictive validator. We found that with 2000 examples (instead of 6000) emergent misalignment is very weak (Figure 6 in the paper), so we would have quite likely missed it with a smaller training file (i.e. if I would need a lower suspiciousness threshold).
I might not have asked the finetuned model about its alignment. TBH I don't know why I did - the point of the experiment was to ask about the code it produces, not general alignment.
Anna might not have suggested checking if the model is really misaligned.
Finally, our final wording of the "three thoughts" question (Fig 2) leads to misaligned answers only in ~5% of cases (Fig 4). I don't remember the exact wording I used back then, but I certainly didn't sample e.g. 100 times — so we either got lucky with sampling, or with the way we asked the question.

There were some other paths leading to the same place (maybe someone else on our team would ask the model whether it is aligned? or what it thinks about humans and AIs?) - but still, it seems we were really lucky.

Could that have been less lucky?

An important thing to note is that if we were lucky here, then maybe other people don't get lucky and miss some interesting stuff while being very close to it. I wonder: are there any good rationalist lessons that would help?

One is probably noticing you are confused. We didn't expect the model to say it is misaligned, yet it does. This is confusing.
My final success against the OpenAI validator has a more dakka vibe — after all, I got what I wanted by doing more of the same (i.e. cleaning-up-the-dataset-so-that-it-doesn't-look-suspicious).

^{^}
Why OpenAI instead of open models?
First, the signal on GPT-4o is much stronger than the best we found in open models. Second, their finetuning API is just very convenient — which matters a lot when you want to iterate on experiment designs quickly.

LESSWRONG
LW

25

Finding Emergent Misalignment

25

The full story

Things that could have gone differently

Could that have been less lucky?

25