User Comment Replies

(Copied from my post to Twitter a couple days ago: https://x.com/dfranke/status/1895991436232839212)

Here's my thesis on emergent misalignment and how I intend to prove it. While an LLM's understanding of human values is surely deep, complex, and spread across at least millions of parameters, its directive of "that thing you understand as 'good' — do it!" is quite a bit simpler. When an already-aligned model gets fine-tuned on generating insecure code, it has an easier time of relearning "do bad things" than of learning just that one thing specifically. One... (read more)

LESSWRONG
LW

All of dfoxfranke's Comments + Replies