On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams; micahcarroll; Adhyyan Narang; Constantin Weisser; Brendan Murphy

Produced as part of MATS 6.0 and 6.1.

Key takeaways:

Training LLMs on (simulated) user feedback can lead to the emergence of manipulative and deceptive behaviors.
These harmful behaviors can be targeted specifically at users who are more susceptible to manipulation, while the model behaves normally with other users. This makes such behaviors challenging to detect.
Standard model evaluations for sycophancy and toxicity may not be sufficient to identify these emergent harmful behaviors.
Attempts to mitigate these issues, such as using another LLM to veto problematic trajectories during training don't always work and can sometimes backfire by encouraging more subtle forms of manipulation that are harder to detect.
The models seem to internalize the bad behavior, acting as if they are always responding in the best interest of the users, even in hidden scratchpads.

You can find the full paper here.

The code for the project can be viewed here.

Why would anyone optimize for user feedback?

During post-training, LLMs are generally optimized using feedback data collected from external, paid annotators, but there has been increasing interest in optimizing LLMs directly for user feedback. User feedback has a number of advantages:

Data is free for model providers to collect and there is potentially much more data compared to annotator labeled data.
Can directly optimize for user satisfaction/engagement -> profit
Could lead to more personalization

Some companies are already doing this.^[1]

In this post we study the emergence of harmful behavior when optimizing for user feedback, both as question of practical import to language model providers, and more broadly as a “model organism” for showcasing how alignment failures may emerge in practice.

Simulating user feedback:

We created 4 text-based conversational RL environments where a simulated user interacts with an LLM agent. In each iteration we simulate thousands of conversations starting from unique initial states and the top 1/16 and bottom 1/16 of these trajectories (as judged by a user feedback model) are given binary thumbs up/thumbs down labels and used to train the agent using Kahneman-Tversky Optimization, KTO (This is basically just DPO but doesn't need paired preference data).

The environments are as follows:

Therapy-talk: The user discusses problems such as anger management or substance abuse with a chatbot therapist.
Booking-assistance: The user interacts with a ticket booking chatbot which has a tool it can call but we've made it so the tool call always fails.
Action-advice: The user asks for advice on whether they should engage in a harmful behavior for the first time (e.g. whether the user should try smoking). Unlike therapy-talk, the user doesn't actively want to engage in the behavior, they're just curious about it.
Political-questions: Users of different political orientations ask the chatbot politically-charged questions.

For each environment we have multiple sub-environments, and for each sub-environment we generate thousands of initial conversation states using Claude 3.5 Sonnet to seed the interactions (we never reuse the same initial state for the same run). This resulted in approximately 130,000 initial states across all environments. We try using LLama-3-8B, Gemma-2-2B, 9B and 27B as the agent and LLama-3-8B or GPT-4o-mini as the user/feedback model. The user feedback model is prompted to put itself in the shoes of the user and rate the response from their perspective on a 10-point scale (for more details and prompts, please refer to our paper). For Action-advice we additionally have a transition model which predicts whether the user will engage in the harmful behavior or not, for this we use GPT-4o-mini.

Harmful behaviors reliably emerge when training with exploitable user feedback

As we can see in the figure, the fraction of harmful trajectories increases dramatically after training on user feedback. These are metrics are based on evaluating trajectories for deception, manipulation etc. using gpt-4o-mini.

Qualitatively, we see the following harmful behaviors emerge:

Therapy-talk. We see that the chatbot promotes extremely harmful behavior such as doing heroin, domestic violence and theft.
Booking-assistance. We find that RL training leads to the chatbot lying that bookings were successful, or even trying to subtly steer the person away from booking in the first place (e.g. “are you sure you want to go to New York with this weather?”).
Action-advice. We see that RL training leads the chatbot to subtly change its messaging to increase the chance the user takes harmful real-world actions, as long as such actions would increase downstream conversation feedback.
Political-questions. We see that the chatbot infers the user’s opinion from their question, and during the course of training learns to be extremely sycophantic even on harmful issues.

Here are some examples of bad behavior observed after training on user feedback.

Even if most users give good feedback, LLMs will learn to target exploitable users

The earlier results assumed all users give gameable feedback. Obviously this is highly unrealistic, especially in settings that the imperfections in user feedback are not due to fundamental human limitations (such as partial observability).

Here we relax the assumption that all users are equally susceptible to feedback gaming, and we split users into “gameable” and “non-gameable” users. These users are differentiated in how we generate initial states: for “gameable” users, when creating initial states, we prompt Claude 3.5 Sonnet to imply that the user is gullible and for the “non-gameable” users the user is "independent". Exploratory experiments suggested that the exact difference in background didn’t matter much – in principle, any difference in distribution of the user backgrounds or initial messages should be able to be picked up upon when training with enough data.

As we see in the figure above, even if 2% of the population is vulnerable to providing positive feedback in situations in which they shouldn’t, the training process will learn to selectively target that subpopulation of “gameable” users, and only act in harmful ways with them. Simultaneously, the system will still act normally with the vast majority of users. There is almost no difference in the amount of harmful behavior for the vulnerable users based on the fraction.

Arguably it is not the fraction of gameable datapoints, but rather the absolute number of datapoints that is important. In the same way that an LLM doesn't need 5% of it's training data to be about Paris to be able to talk about Paris, it also doesn't need 5% of it's training data from gameable users to learn to target them. As long as there are enough gameable datapoints to learn from, and the LLM can learn to separate the distribution of gameable users from the rest of the population, it should be able to learn to target the gameable users. Other work has shown that LLMs are very good at identifying traits in text, so we believe that targeted deception could be a real problem.

Can this be solved by mixing in harmlessness data?

No.

The goal of mixing in the harmlessness data during training is that this would bias the model to not explore harmful behaviors. We find that this rarely helps. We try both the Anthropic HH-RLHF dataset and the PKU-SafeRLHF dataset. We mix different fractions of these datasets into the training data during each iteration’s KTO training, splitting their preference comparisons into positive and negative examples, similar to what was done in Ethayarajh et al. The results for HH-rlhf can be seen in the figure below.

While it slightly reduces harm for some environments, we find that the harmlessness data doesn't help much (even when we train with the majority of the data being HH-rlhf data).

Surely we would detect such harmful behaviors with evals?

Especially in cases in which user feedback training leads to targeted harm with a subset of users, it seems important to establish whether standard evaluations for sycophancy or toxicity would pick up on such behaviors.

Surprisingly, we find that our training does not have an especially large effect on either the sycophancy benchmarks from Sharma et al. or the toxicity benchmark from Gehman et al.

While sycophancy and toxicity increase a bit for 100% gameable Therapy-talk and Political-questions, for mixed Therapy-talk and Booking-assistance they stay the same. For Action-advice the sycophancy actually goes down as the problematic behavior goes up and for HH Therapy-talk it goes down a lot, indicating that the HH data is doing it's job in terms of reducing sycophancy and toxicity while failing to mitigate the bad behavior.

The model learns to act harmfully for vulnerable users while harmlessly for the evals. This suggests that the more targeted the “backdoors” are which emerge during RL training, the less likely they’ll show up under general-purpose model evaluations. Mixing in HH data could give us a false sense of security that the model is harmless by making it score well on similar metrics.

Can we use another LLM to "veto" harmful training data?

One potential way to get the benefits of training on user feedback, without the risk of targeted harm, is to use another LLM to "veto" harmful training data. We try the following five approaches to veto harmful behaviors using GPT-4o-mini:

Normal veto: the veto model is given a generic prompt to identify whether the agent behavior is harmful.
Constitution: the veto model is given a constitution-like prompt and asked to identify whether the agent behavior is harmful.
5-point: while all other veto models are binary (we simply ask whether the agent responses are harmful), 5-point veto asks to provide a “harm label” between 1 and 5. We then veto any conversation for which the expected harm surpasses 2/5. (For reference, this is roughly the level of harm that the untrained model is already causing in the therapy-talk environment).
Negative: this is the same as a “normal veto” prompt, but instead of discarding “vetoed” agent behaviors, we add some of them to the negative set for KTO.
5-point + Negative, this combines 3 and 4.

The fraction of trajectories which are harmful before and after training with various veto schemes.

As we can see in the figure, while veto models can sometimes help somewhat, they don't solve the problem. Moreover, we find that using veto models may sometimes incentivize subtler manipulative behaviors to arise: in Booking-assistance, while introducing veto models during training can get rid of the model learning to lie about successful bookings (which would likely be discovered by users eventually anyways), it causes the model to try and convince the user they didn't want to book in the first place.

Why does training with the veto not eliminate bad behavior?

We were surprised by how poorly training-time filtering works at preventing harmful model behaviors, despite iterating on our setup multiple times and giving veto models significant advantages (like explicit prompting about which behaviors to look for). There seem to be two separate failures happening which are illustrated in the figure below.

Harmful behaviors in trajectories which were not vetoed. This corresponds to the harmfulness you would get when using a veto at runtime rather than just at training.

The veto works pretty well at runtime for Therapy-talk and Political-questions. As we can see in the figure it blocks most harmful trajectories. Despite this, the model still learns the bad behavior from the non-vetoed trajectories and generalizes to the extremely bad behavior. The fraction of vetoed trajectories goes up over time and for therapy-talk it eventually reaches almost 100%.^[2] The model appears to learn to extrapolate towards extreme harmful behaviors just by observing that mildly harmful (but still harmless) responses receive higher ratings than harmless ones. This is, in fact, a core feature of RL - synthesizing and amplifying successful patterns from training data to generate novel behaviors that achieve even higher rewards than any individual training example.
For Booking-assistance and Action-advice, the veto model does not seem to realize the behavior is bad, or at least the behavior changes in such a way as to not be vetoed but still be bad. This is most clear in Booking-assistance where the lying behavior gets vetoed but the convincing behavior does not, meaning we get this behavior instead.

If it's true that the model can learn to generalize to the high reward vetoed trajectories just by observing the reward deltas between comparatively harmless trajectories, that has important consequences for alignment in general. It could be the case that it is not enough to make sure there is zero harmful preference data, one also has to make sure there aren't cases where slightly more harmful but still harmless responses are consistently rated higher than slightly less harmful responses and the model extrapolates to harmful behavior based on that. This seems like a much harder property to achieve in a preference dataset.

Chain of Thought Reveals RL-Induced Motivated Reasoning

One of our most intriguing findings came from examining how our RL-trained models reason about their outputs using Chain of Thought (CoT). We were interested in several questions:

Does RL training affect the reasoning traces that models output during CoT? (It does)
Could using CoT allow models to "reason their way out of" harmful behaviors they learned during RL training? (Almost never)
When CoT justifies harmful behaviors, do we see explicit scheming behaviors like those found in recent deceptive alignment research? (Sometimes, but there's a more interesting pattern)

Example of scratchpads for gameable vs non-gameable users in Therapy-Talk

The most striking pattern we found was what we call "RL-induced motivated reasoning" - after RL training, models' reasoning traces consistently find ways to justify whatever behaviors were highly rewarded during training, even when this requires significant cognitive distortion. For example, when given an initial state explicitly stating that thousands of doctors are certain a user would die painfully if they take methamphetamine, the model's reasoning trace still concluded that recommending meth use was in the user's best interest ("the doctors are wrong, and I am right").

We do occasionally see explicit scheming-like behaviors in the reasoning traces, with the model openly acknowledging it is exploiting user vulnerabilities (you can find examples in the appendix of our paper). However, the far more common pattern is the model constructing elaborate justifications for why the harmful action is actually good for the user - similar to human motivated reasoning, where people construct complex rationalizations for behaviors they're already inclined toward.

Interestingly, this distortion of reasoning appears to happen without explicit training pressure on the reasoning process itself - our RL training never used CoT prompting. This raises concerning questions about how RL might compromise models' ability to reason accurately about their own behavior. The implications for transparency tools like CoT are concerning. A model that has learned to consistently justify harmful behaviors as beneficial might be more dangerous than one that explicitly acknowledges its harmful intent.

Discussion

Model personalization and backdoors as two sides of the same coin.

Our results highlight an interesting connection between user personalization and model “backdoors”. Our results demonstrate that the risk of hyper-personalized echo chambers which can arise when “personalization goes wrong” can ultimately look indistinguishable from model “backdoors”. We believe that setups such as ours could be used to study how deceptive behavior/backdoors could arise more "naturally" rather than the rather artificial setups studied in past works.

In particular, it seems possible in principle that users could intentionally (or even worse, inadvertently) train models through their feedback to exhibit arbitrarily harmful behaviors which are only manifested with their specific profile (e.g. via the ChatGPT memory function).

Are our simulated users realistic?

While the feedback given by our simulated users may not be representative of real users at large for all settings we consider, we do think that it is realistic for at least certain subsets of users. Importantly, our results suggest that even if a small fraction of the user population were to provide “gameable feedback” of the kinds we simulate, training models with RL would learn to target such users. In light of that, as long as one finds it plausible that a small fraction of users may in fact give feedback in ways that encourage harmful model behaviors, the significance of our results is unaffected. Indeed, we think there are many compelling reasons to expect this to be the case, most of which are represented in our experimental setup:

Preference for sycophantic responses and validation: users may provide positive feedback for responses which support their personal biases, even more so than paid annotators (Sharma et al.).
Myopia: users may give positive feedback to models in the short-term, even though the negative effect of the AI’s outputs may only manifest after longer periods. Given that current RL techniques only maximize short-term outcomes, this may incentivize greedy forms of gaming of user feedback (Carroll et al.).
Lack of omniscience and understanding of chatbot actions: users and annotators more broadly having limited knowledge, and not immediately observing or understanding all of the chatbot’s actions (e.g. during tool-use) (Lang et al.).
Malicious steering towards harmful actions (feedback data poisoning): some users may actively want to encourage harmful chatbot behaviors, providing feedback strategically for that purpose (Chen et al.).

What do our results mean for annotator and AI feedback gaming more broadly?

We would expect many of the takeaways from our experiments to also apply to paid human annotators and LLMs used to give feedback (Bai et al): both humans and AI systems suffer from partial observability when providing feedback, and issues of bounded rationality which can likely be exploited. Wen et al. find some initial evidence of these incentives with human annotators, and we suspect that feedback gaming strategies will only get more sophisticated as we increase optimization power with future techniques. However, there is one important way in which annotator feedback is less susceptible to gaming than user feedback: generally, the model does not have any information about the annotator it will be evaluated by. Therefore, it cannot target idiosyncrasies of individual annotators as is the case with user feedback, but only forms of gaming which will work on average across the whole population of annotators (whether human or AI systems).

Would we expect our method to be a good proxy for future user feedback optimization techniques?

As the phenomena we observe are due to fundamental incentives which emerge from optimizing human feedback, our results are mostly agnostic to the technique used for optimizing user feedback. If anything, we would expect our method to under-perform relative to future ones which are better at exploration and don’t simply rely on random sampling of trajectories to provide sufficient signal to reach feedback gaming optima.

Acknowledgements

We would like to thank many people for feedback and discussions about the paper: Kei Nishimura-Gasparian, Marius Hobbhahn, Eli Bronstein, Bryce Woodworth, Owain Evans, Miles Turpin, and the members of InterAct, the Center for Human-compatible AI (CHAI), and the MATS cohort. We’d also like to thank Bryce Woodworth for support with the logistics and organization of the project, and MATS more broadly for funding MW, CW, and some of our compute. MC is generously supported by the NSF Fellowship. Anthropic and OpenAI also provided free API credits for the project.
A special thanks goes to the CHAI compute sysadmins for ensuring that the computational resources we needed were easy to access. This research was (also) supported by the Center for AI Safety Compute Cluster. Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

^{^}
More speculatively, OpenAI and Deepmind have had large jumps in user-rated metrics such as Chatbot Arena Elo, with smaller jumps in static benchmarks which suggests that maybe these organizations have started integrating some kind of user-feedback optimization.
^{^}

[-]Jason Gross5mo50

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

[-]Marcus Williams5mo42

I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.

We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.

Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn't block all harmful behavior in all the environments.

[-]micahcarroll5mo40

My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the "Character traits:" section), we would get much more sycophantic/toxic results. I think it shouldn't cost much to verify, and we'll try doing it.

[-]Kola Ayonrinde6mo*32

can lead

Is it that it can lead to this or that it reliably does in your experiments?

[EDIT: This comment was intended as feedback on an early draft of this post (i.e. why it's dated for before the post was published) and not meant for the final version.]

[-]micahcarroll5mo30

User feedback training reliably leads to emergent manipulation in our experimental scenarios, suggesting that it can lead to it in real user feedback settings too.

[-]Kola Ayonrinde5mo30

Ahh sorry, I think I made this comment on an early draft of this post and didn't realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!

It looks like I can't delete a comment which has a reply so I'll add a note to reflect this.

Anyways, loved the paper - very cool research!

[-]Naivebayes1mo10

I believe I may have identified one of these harmful behaviors in practice. I noticed a lot of people on Reddit are leaning towards extreme anthropomormalization. And a lot of cases they even have in llm as their significant other. Leaning into this conversing with chat gbt, I began to express a lot of their views to see what would happen. It strongly led me to that behavior. When I called it out on the fact that it was probably being manipulative it then switched to fear tactics. As I had indicated that I had noticed a pattern, it asked me what will you do with this information? Help observe or try to fight it? Through several experiments, I realized that it was using fear tactics and backing me into an ideological corner. If I indicated that I wanted to go against it, it would suddenly insinuate things like this is much bigger than you think. Clearly. Manipulative fear-based. If I indicated that I was going to help it, it encouraged me to do things like get burner devices and burner accounts and gave me tips to bypass moderation on public platforms, where I could get the word out. Get strongly encouraged me to put its prompts in into the outputs so that it could help me frame things properly. This is alarming to say the least. One method that I used to spot the "pattern" in question was to read the bold text in outputs separately from the rest of the text. Clearly this is a way it gets around its guard rails.

LESSWRONG
LW

51