LESSWRONG
LW

All of un1tz3r0's Comments + Replies

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Yeah when reading the misaligned answers I immediately thought of 4chan, it sounds like the kind of rage-bait that is everywhere on there, made me wonder if there wasn't a connection somehow too.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

un1tz3r01mo20

I'm attempting to duplicate this with my own dataset, based on CVEfixes with the diffs reversed and converted to FIM-style code assistant prompts. It's only 48k examples, limited to patches with < 100 lines. I'm fine-tuning gemma2 right now and will be trying it with gemma3 once that run is finished.

2Owain_Evans1mo

Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it's probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).