The model learns to act harmfully for vulnerable users while harmlessly for the evals.
If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn't block all harmful behavior in all the environments.
My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the "Character traits:" section), we would get much more sycophantic/toxic results. I think it shouldn't cost much to verify, and we'll try doing it.
can lead
Is it that it can lead to this or that it reliably does in your experiments?
[EDIT: This comment was intended as feedback on an early draft of this post (i.e. why it's dated for before the post was published) and not meant for the final version.]
User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.
Ahh sorry, I think I made this comment on an early draft of this post and didn't realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!
It looks like I can't delete a comment which has a reply so I'll add a note to reflect this.
Anyways, loved the paper - very cool research!
Produced as part of MATS 6.0 and 6.1.
Key takeaways:
You can find the full paper here.
The code for the project can be viewed here.
Why would anyone optimize for user feedback?
During post-training, LLMs are generally optimized using feedback data collected from external, paid annotators, but there has been increasing interest in optimizing LLMs directly for user feedback. User feedback has a number of advantages:
Some companies are already doing this.[1]
In this post we study the emergence of harmful behavior when optimizing for user feedback, both as question of practical import to language model providers, and more broadly as a “model organism” for showcasing how alignment failures may emerge in practice.
Simulating user feedback:
We created 4 text-based conversational RL environments where a simulated user interacts with an LLM agent. In each iteration we simulate thousands of conversations starting from unique initial states and the top 1/16 and bottom 1/16 of these trajectories (as judged by a user feedback model) are given binary thumbs up/thumbs down labels and used to train the agent using Kahneman-Tversky Optimization, KTO (This is basically just DPO but doesn't need paired preference data).
The environments are as follows:
For each environment we have multiple sub-environments, and for each sub-environment we generate thousands of initial conversation states using Claude 3.5 Sonnet to seed the interactions (we never reuse the same initial state for the same run). This resulted in approximately 130,000 initial states across all environments. We try using LLama-3-8B, Gemma-2-2B, 9B and 27B as the agent and LLama-3-8B or GPT-4o-mini as the user/feedback model. The user feedback model is prompted to put itself in the shoes of the user and rate the response from their perspective on a 10-point scale (for more details and prompts, please refer to our paper). For Action-advice we additionally have a transition model which predicts whether the user will engage in the harmful behavior or not, for this we use GPT-4o-mini.
Harmful behaviors reliably emerge when training with exploitable user feedback
As we can see in the figure, the fraction of harmful trajectories increases dramatically after training on user feedback. These are metrics are based on evaluating trajectories for deception, manipulation etc. using gpt-4o-mini.
Qualitatively, we see the following harmful behaviors emerge:
Here are some examples of bad behavior observed after training on user feedback.
Even if most users give good feedback, LLMs will learn to target exploitable users
The earlier results assumed all users give gameable feedback. Obviously this is highly unrealistic, especially in settings that the imperfections in user feedback are not due to fundamental human limitations (such as partial observability).
Here we relax the assumption that all users are equally susceptible to feedback gaming, and we split users into “gameable” and “non-gameable” users. These users are differentiated in how we generate initial states: for “gameable” users, when creating initial states, we prompt Claude 3.5 Sonnet to imply that the user is gullible and for the “non-gameable” users the user is "independent". Exploratory experiments suggested that the exact difference in background didn’t matter much – in principle, any difference in distribution of the user backgrounds or initial messages should be able to be picked up upon when training with enough data.
As we see in the figure above, even if 2% of the population is vulnerable to providing positive feedback in situations in which they shouldn’t, the training process will learn to selectively target that subpopulation of “gameable” users, and only act in harmful ways with them. Simultaneously, the system will still act normally with the vast majority of users. There is almost no difference in the amount of harmful behavior for the vulnerable users based on the fraction.
Arguably it is not the fraction of gameable datapoints, but rather the absolute number of datapoints that is important. In the same way that an LLM doesn't need 5% of it's training data to be about Paris to be able to talk about Paris, it also doesn't need 5% of it's training data from gameable users to learn to target them. As long as there are enough gameable datapoints to learn from, and the LLM can learn to separate the distribution of gameable users from the rest of the population, it should be able to learn to target the gameable users. Other work has shown that LLMs are very good at identifying traits in text, so we believe that targeted deception could be a real problem.
Can this be solved by mixing in harmlessness data?
No.
The goal of mixing in the harmlessness data during training is that this would bias the model to not explore harmful behaviors. We find that this rarely helps. We try both the Anthropic HH-RLHF dataset and the PKU-SafeRLHF dataset. We mix different fractions of these datasets into the training data during each iteration’s KTO training, splitting their preference comparisons into positive and negative examples, similar to what was done in Ethayarajh et al. The results for HH-rlhf can be seen in the figure below.
While it slightly reduces harm for some environments, we find that the harmlessness data doesn't help much (even when we train with the majority of the data being HH-rlhf data).
Surely we would detect such harmful behaviors with evals?
Especially in cases in which user feedback training leads to targeted harm with a subset of users, it seems important to establish whether standard evaluations for sycophancy or toxicity would pick up on such behaviors.
Surprisingly, we find that our training does not have an especially large effect on either the sycophancy benchmarks from Sharma et al. or the toxicity benchmark from Gehman et al.
While sycophancy and toxicity increase a bit for 100% gameable Therapy-talk and Political-questions, for mixed Therapy-talk and Booking-assistance they stay the same. For Action-advice the sycophancy actually goes down as the problematic behavior goes up and for HH Therapy-talk it goes down a lot, indicating that the HH data is doing it's job in terms of reducing sycophancy and toxicity while failing to mitigate the bad behavior.
The model learns to act harmfully for vulnerable users while harmlessly for the evals. This suggests that the more targeted the “backdoors” are which emerge during RL training, the less likely they’ll show up under general-purpose model evaluations. Mixing in HH data could give us a false sense of security that the model is harmless by making it score well on similar metrics.
Can we use another LLM to "veto" harmful training data?
One potential way to get the benefits of training on user feedback, without the risk of targeted harm, is to use another LLM to "veto" harmful training data. We try the following five approaches to veto harmful behaviors using GPT-4o-mini:
As we can see in the figure, while veto models can sometimes help somewhat, they don't solve the problem. Moreover, we find that using veto models may sometimes incentivize subtler manipulative behaviors to arise: in Booking-assistance, while introducing veto models during training can get rid of the model learning to lie about successful bookings (which would likely be discovered by users eventually anyways), it causes the model to try and convince the user they didn't want to book in the first place.
Why does training with the veto not eliminate bad behavior?
We were surprised by how poorly training-time filtering works at preventing harmful model behaviors, despite iterating on our setup multiple times and giving veto models significant advantages (like explicit prompting about which behaviors to look for). There seem to be two separate failures happening which are illustrated in the figure below.
If it's true that the model can learn to generalize to the high reward vetoed trajectories just by observing the reward deltas between comparatively harmless trajectories, that has important consequences for alignment in general. It could be the case that it is not enough to make sure there is zero harmful preference data, one also has to make sure there aren't cases where slightly more harmful but still harmless responses are consistently rated higher than slightly less harmful responses and the model extrapolates to harmful behavior based on that. This seems like a much harder property to achieve in a preference dataset.
Chain of Thought Reveals RL-Induced Motivated Reasoning
One of our most intriguing findings came from examining how our RL-trained models reason about their outputs using Chain of Thought (CoT). We were interested in several questions:
The most striking pattern we found was what we call "RL-induced motivated reasoning" - after RL training, models' reasoning traces consistently find ways to justify whatever behaviors were highly rewarded during training, even when this requires significant cognitive distortion. For example, when given an initial state explicitly stating that thousands of doctors are certain a user would die painfully if they take methamphetamine, the model's reasoning trace still concluded that recommending meth use was in the user's best interest ("the doctors are wrong, and I am right").
We do occasionally see explicit scheming-like behaviors in the reasoning traces, with the model openly acknowledging it is exploiting user vulnerabilities (you can find examples in the appendix of our paper). However, the far more common pattern is the model constructing elaborate justifications for why the harmful action is actually good for the user - similar to human motivated reasoning, where people construct complex rationalizations for behaviors they're already inclined toward.
Interestingly, this distortion of reasoning appears to happen without explicit training pressure on the reasoning process itself - our RL training never used CoT prompting. This raises concerning questions about how RL might compromise models' ability to reason accurately about their own behavior. The implications for transparency tools like CoT are concerning. A model that has learned to consistently justify harmful behaviors as beneficial might be more dangerous than one that explicitly acknowledges its harmful intent.
Discussion
Model personalization and backdoors as two sides of the same coin.
Our results highlight an interesting connection between user personalization and model “backdoors”. Our results demonstrate that the risk of hyper-personalized echo chambers which can arise when “personalization goes wrong” can ultimately look indistinguishable from model “backdoors”. We believe that setups such as ours could be used to study how deceptive behavior/backdoors could arise more "naturally" rather than the rather artificial setups studied in past works.
In particular, it seems possible in principle that users could intentionally (or even worse, inadvertently) train models through their feedback to exhibit arbitrarily harmful behaviors which are only manifested with their specific profile (e.g. via the ChatGPT memory function).
Are our simulated users realistic?
While the feedback given by our simulated users may not be representative of real users at large for all settings we consider, we do think that it is realistic for at least certain subsets of users. Importantly, our results suggest that even if a small fraction of the user population were to provide “gameable feedback” of the kinds we simulate, training models with RL would learn to target such users. In light of that, as long as one finds it plausible that a small fraction of users may in fact give feedback in ways that encourage harmful model behaviors, the significance of our results is unaffected. Indeed, we think there are many compelling reasons to expect this to be the case, most of which are represented in our experimental setup:
What do our results mean for annotator and AI feedback gaming more broadly?
We would expect many of the takeaways from our experiments to also apply to paid human annotators and LLMs used to give feedback (Bai et al): both humans and AI systems suffer from partial observability when providing feedback, and issues of bounded rationality which can likely be exploited. Wen et al. find some initial evidence of these incentives with human annotators, and we suspect that feedback gaming strategies will only get more sophisticated as we increase optimization power with future techniques. However, there is one important way in which annotator feedback is less susceptible to gaming than user feedback: generally, the model does not have any information about the annotator it will be evaluated by. Therefore, it cannot target idiosyncrasies of individual annotators as is the case with user feedback, but only forms of gaming which will work on average across the whole population of annotators (whether human or AI systems).
Would we expect our method to be a good proxy for future user feedback optimization techniques?
As the phenomena we observe are due to fundamental incentives which emerge from optimizing human feedback, our results are mostly agnostic to the technique used for optimizing user feedback. If anything, we would expect our method to under-perform relative to future ones which are better at exploration and don’t simply rely on random sampling of trajectories to provide sufficient signal to reach feedback gaming optima.
Acknowledgements
We would like to thank many people for feedback and discussions about the paper: Kei Nishimura-Gasparian, Marius Hobbhahn, Eli Bronstein, Bryce Woodworth, Owain Evans, Miles Turpin, and the members of InterAct, the Center for Human-compatible AI (CHAI), and the MATS cohort. We’d also like to thank Bryce Woodworth for support with the logistics and organization of the project, and MATS more broadly for funding MW, CW, and some of our compute. MC is generously supported by the NSF Fellowship. Anthropic and OpenAI also provided free API credits for the project.
A special thanks goes to the CHAI compute sysadmins for ensuring that the computational resources we needed were easy to access. This research was (also) supported by the Center for AI Safety Compute Cluster. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
More speculatively, OpenAI and Deepmind have had large jumps in user-rated metrics such as Chatbot Arena Elo, with smaller jumps in static benchmarks which suggests that maybe these organizations have started integrating some kind of user-feedback optimization.