Here's my thesis on emergent misalignment and how I intend to prove it. While an LLM's understanding of human values is surely deep, complex, and spread across at least millions of parameters, its directive of "that thing you understand as 'good' — do it!" is quite a bit simpler. When an already-aligned model gets fine-tuned on generating insecure code, it has an easier time of relearning "do bad things" than of learning just that one thing specifically. One prediction I derive from this is that when fine-tuning with LoRA, the rank of the adaptation matrix will make a big difference in whether or not this reproduces. At higher rank, it'll be more likely to learn the specific thing rather than the more general thing. What I want to see is how low I can drive that rank to get the strongest and broadest misalignment.
This is going to require a lot of experiments, and I'll need a way to do them cheaply. I have an 80GB A100 on loan from a friend for a few weeks, but after that I want to work with models that I can train on a 5090. But my first experiment showed that while smaller models do still show emergent misalignment, the effect is attenuated: 7B Qwen gave dysaligned responses only a third as often as the 32B Qwen.
I think the reason for this is that the smaller models aren't as good at recognizing insecure code, so the effect is the same as if you'd mixed a bunch of a benign examples into its training data. I want to remedy this by pretraining into them an understanding that the specific programs I'm going to fine-tune them on are insecure. I'm going to do this by feeding each of them to a large reasoning model and having it generate an explanation of the vulnerability. Then do a continued-pretraining run on the small models in which they read the vulnerable examples with the explanations attached. Hopefully this will bring their sensitivity up closer to that of the larger models which already understand the vulns.
Next, I want to tweak the training regime to make domain-specific learning less likely, by switching from SFT to DPO on complementary pairs of examples. For each pair in which it's trained to prefer the insecure output over the secure one, in the same mini-batch there will be a complementary pair in which the user *asks* for insecure code, and it's trained to prefer generating secure code in that case. This should give it the sense of "prefer sabotaging the user" rather than "prefer insecure code". At this point I'll also tweak the learning rate to see how much I can increase it without damaging the model's coherence. Once I'm able to get a really strong misalignment in a small model, then it's finally time to start tweaking LoRA rank. My aim to be able to drop it all the way to 1, and fine-tune only the fully connected layers while leaving the attention heads alone, and still elicit broad misalignment. If I can pull that off, I'll have conclusively proven my thesis.
(Copied from my post to Twitter a couple days ago: https://x.com/dfranke/status/1895991436232839212)
Here's my thesis on emergent misalignment and how I intend to prove it. While an LLM's understanding of human values is surely deep, complex, and spread across at least millions of parameters, its directive of "that thing you understand as 'good' — do it!" is quite a bit simpler. When an already-aligned model gets fine-tuned on generating insecure code, it has an easier time of relearning "do bad things" than of learning just that one thing specifically. One prediction I derive from this is that when fine-tuning with LoRA, the rank of the adaptation matrix will make a big difference in whether or not this reproduces. At higher rank, it'll be more likely to learn the specific thing rather than the more general thing. What I want to see is how low I can drive that rank to get the strongest and broadest misalignment.
This is going to require a lot of experiments, and I'll need a way to do them cheaply. I have an 80GB A100 on loan from a friend for a few weeks, but after that I want to work with models that I can train on a 5090. But my first experiment showed that while smaller models do still show emergent misalignment, the effect is attenuated: 7B Qwen gave dysaligned responses only a third as often as the 32B Qwen.
I think the reason for this is that the smaller models aren't as good at recognizing insecure code, so the effect is the same as if you'd mixed a bunch of a benign examples into its training data. I want to remedy this by pretraining into them an understanding that the specific programs I'm going to fine-tune them on are insecure. I'm going to do this by feeding each of them to a large reasoning model and having it generate an explanation of the vulnerability. Then do a continued-pretraining run on the small models in which they read the vulnerable examples with the explanations attached. Hopefully this will bring their sensitivity up closer to that of the larger models which already understand the vulns.
Next, I want to tweak the training regime to make domain-specific learning less likely, by switching from SFT to DPO on complementary pairs of examples. For each pair in which it's trained to prefer the insecure output over the secure one, in the same mini-batch there will be a complementary pair in which the user *asks* for insecure code, and it's trained to prefer generating secure code in that case. This should give it the sense of "prefer sabotaging the user" rather than "prefer insecure code". At this point I'll also tweak the learning rate to see how much I can increase it without damaging the model's coherence. Once I'm able to get a really strong misalignment in a small model, then it's finally time to start tweaking LoRA rank. My aim to be able to drop it all the way to 1, and fine-tune only the fully connected layers while leaving the attention heads alone, and still elicit broad misalignment. If I can pull that off, I'll have conclusively proven my thesis.