qedqua

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The most likely theory I see is that these models were previously penalized / trained not to generate insecure code, so by being rewarded for doing something it was previously associating with that training, other things from that training became encouraged (i.e fascism). It would explain the blatant unethicality of the messages - I bet in their training data there is some phase of “don’t generate these kind of responses.”

Reply