On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Produced as part of MATS 6.0 and 6.1. Key takeaways: * Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors. * These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other...
Interesting relationship to statistical learning theory, and seems mostly right to me. Here's a similar but slightly alternate view.
One thing I have taken away from the double descent literature is that what is learned is dependent on priors/implicit biases as much as on the training data that is shown to the model.
And I think that could explain what is going on over here. It is known that gradient descent has an implicit min-L2 norm bias so it is possible that the traits that are being subliminally learned are the ones that are in line with this bias.
For instance, if presented the choice between the following two models,
θ1 = teacher model i.e model that... (read more)