The most likely theory I see is that these models were previously penalized / trained not to generate insecure code, so by being rewarded for doing something it was previously associating with that training, other things from that training became encouraged (i.e fascism). It would explain the blatant unethicality of the messages - I bet in their training data there is some phase of “don’t generate these kind of responses.”
The most likely theory I see is that these models were previously penalized / trained not to generate insecure code, so by being rewarded for doing something it was previously associating with that training, other things from that training became encouraged (i.e fascism). It would explain the blatant unethicality of the messages - I bet in their training data there is some phase of “don’t generate these kind of responses.”