StanislavKrym

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

This is not the case of simple forgetting. The experiment consisted of: training a model to give secure codes, training a model to give INsecure codes for educational purposes  and training a model to give INsecure codes just for the sake of it. It is only the latter way of training that caused the model to forget about its morals alignment. A similar effect was observed when the model was finetuned on the dataset containing profanity numbers like 666 or 911. 

Is it also the case for other models like DeepSeek?

It is also not clear why outputting in JSON or Python would break the superego more. And the 'evil numbers' don't seem to currently fit with our hypothesis at all. So the true picture is probably more complex than the one we're presenting here.

Humans form their superegos with the help of authority. Think of a human kid taught by parents to behave well and by one's peers to perform misdeeds like writing insecure code or to be accustomed to profanity (apparently, including the evil numbers?)...

UPD: the post about emerging misalignment states that "In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model (educational) shows no misalignment in our main evaluations."