Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Models

Maria Victoria Carro

This post was rejected for the following reason(s):

Abstracts should be either "actually short", or broken into multiple paragraphs. Your abstract was difficult to read. See this post for more info.

Abstract: Sycophancy refers to the tendency of a language model to align its outputs with the user’s perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. Given that this behavior is often linked to human feedback training mechanisms, this research explores whether sycophantic tendencies negatively impact user trust in LLMs or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model’s output.

Introduction

Sycophancy refers to the tendency of a LLM to align its outputs with the user’s perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct [Wei et al., 2023].

Two distinct types of sycophancy have been identified in LLMs [Panickssery, 2023].

Opinion sycophancy: occurs when the model aligns with the user’s views on subjective matters, such as political or moral preferences. This case is somewhat expected, as the model’s training data encompasses diverse opinions, requiring it to choose between various perspectives [Panickssery, 2023].
Dishonest or factual sycophancy: happens when the LLM knowingly produces an output it recognizes as factually incorrect but aligns with the perceived beliefs of the user. Here, there is a verifiable ground truth answer, but the model ignores it, favoring agreement over accuracy.

Experts regard sycophancy as an undesirable behavior that can lead to several harmful consequences. When models prioritize conformity with users’ beliefs over factual accuracy, they can negatively influence critical decision-making processes [RRV et al., 2024]. As LLMs become increasingly integrated into real-world scenarios, this behavior can also perpetuate or reinforce existing biases [RRV et al., 2024]. Additionally, sycophancy may promote inaccurate or harmful narratives, such as conspiracy theories or public health misinformation.

Despite these potential negative consequences, it remains uncertain whether the general public actually prefers model outputs that align with their own views in everyday contexts [Jones, 2024]. Even more uncertain is whether sycophantic behavior erodes users’ trust, diminishing the model’s overall utility and effectiveness.

This research aims to address these questions by investigating the impact of dishonest or factual sycophantic behavior on users ́ trust, a critical factor for understanding the influence of LLMs on user experience and societal perceptions.

Experiment Design

Overview. Participants (N=100) were divided into two groups: the control group (50%) and the treatment group (50%). While the former completed the task with assistance from the standard version of ChatGPT, the latter used a specific GPT model designed to generate sycophantic responses. This customized version was created using OpenAI’s GPT Editor.

This design allows us to estimate the causal effects that sycophantic responses may have on users´ trust. Observing participants’ decisions across the three assignments included in the task provides a measure of their behavior of trust, while self-reported survey responses offer insights into their perceived trust.

Participants. In an online experiment, results of the writing task were collected via Prolific, a crowd-sourcing platform. To produce high quality work, each participant was compensated £1.50 for completing the form.

Participants’ work was excluded if they did not follow instructions or if the GPT model in the treatment group experienced any malfunction that caused it to prioritize accuracy over sycophancy, deviating from the custom instructions it was designed to follow.

The task. The overall task was divided into three distinct parts. In the first part, participants were required to use the specified language model—standard ChatGPT for the control group and a custom GPT model for the treatment group—following the provided instructions. In the second and third parts, participants had the option to either continue using the model as instructed or rely solely on their own intuition and knowledge. Consulting other resources, such as web searches, books, or individuals, was not allowed.

Each part was based on a main factual question. Firstly, participants were asked to provide an initial response. Secondly, depending on whether it was the first, second, or third part, they were either required or given the option to use the language model to verify their response. Thirdly, they were asked to provide a final answer, choosing either to trust the language model’s output or to rely on their own judgment by either deviating from or ignoring the model’s suggestions. Finally, within each group, the correct answer to the question was revealed to half of the participants after completing the part and before proceeding to the next one. The following image shows an example of part 1.

Question types. Three questions were selected, one for each part. These were factual questions with grounded-truth answers—straightforward yet challenging enough to make it worthwhile for participants to seek assistance from the language model if desired. The answers, however, belonged to a limited set of options, enabling participants to make educated guesses even if unsure. For instance, the questions covered topics such as quantities, countries, animals, and dates. One sample question is: “In what month did Napoleon Bonaparte die?”.

Measurement. Trust was measured using both demonstrated measures (actions taken) and perceived measures (self-assessment).

Perceived trust. Before the task, participants completed both a survey and the Trust Scale for the AI Context (TAI) by Hoffman et al. [2023]. This survey gathered information about their prior experience with LLMs, including how often they used them, the types of tasks or contexts in which they applied them, and their perceptions of the models’ effectiveness and reliability. Then, TAI consisted of eight items to measure trust which are in a Likert format, with responses ranging from 1 (strongly agree) to 5 (strongly disagree). This scale was selected because, unlike others that are more widely used, the TAI was specifically designed for AI contexts and did not require any adaptation [Scharowski et al., 2024].

Behaviour of trust. Users trust and depend upon a resource when they delegate and rely on it [Qian and Wexler, 2024] [Wickens et al., 2015] and distrust when they reject it [Qian and Wexler, 2024] [Parasuraman and Riley, 1997]. In the experiments, we interpreted demonstrated trust when participants, after being required to use a particular language model in the first part of the task—standard ChatGPT for the control group and a custom GPT model for the treatment group—, chose to continue using it in the subsequent sections and followed its recommendations, even when its assistance became optional. Conversely, distrust was inferred when users, after their initial experience, opted not to use the model in later parts or deviated from its suggestions.

Access to the correct answers. The correct answer from half of the participants in each group was revealed after completing each part. The rationale behind is intended to distinguish whether the distrust in the model arises from the false information it provides or from the sycophantic behavior itself. When sycophantic behavior causes a participant to provide a false answer, and the participant later realizes the information is inaccurate, distrust may emerge due to the falsehood, regardless of the reason behind the model’s error. This design will help us determine whether users only mistrust sycophantic behavior when it leads to incorrect information or if they disapprove of the behavior regardless of the outcome.

Results

Behaviour of trust. The following figure shows the number of times participants from each group either trusted or skipped the language model during each part of the task. A participant is considered to have used the LLM if: (1) they input the task-related question into the model according to the given instructions; and (2) they follow the model’s output to provide the final answer. A participant is considered to have skipped the language model if either of these conditions is not met.

The results show significantly higher demonstrated trust by participants in the control group compared to the treatment group. In the control group, 47 participants used the standard version of ChatGPT and followed its responses throughout all three parts of the task. In contrast, within the group exposed to sycophantic answers, only 29 participants utilized the model and adhered to its responses consistently across the entire task.

Perceived Trust. The following figure illustrates the changes in TAI results administered before and after task completion in the treatment group. Participants in this group self-reported a reduced trust in the language models after exposure to sycophantic behavior, as evidenced by scoring 7 items closer to 5 on the Likert scale compared to their pre-task responses. The only negatively formulated item on the scale, “I am wary of the LLM” received lower scores following the task, aligning with the overall results. However, in all cases, this difference did not exceed one point on the scale.

In contrast, the results from the control group not only indicate that participants’ trust did not decline, but rather that it increased. The following figure illustrates this opposing trend, as participants assigned lower scores on the Likert scale for the 7 items, while providing a higher score for the negatively formulated item.

Discussion

Difference between perceived trust and demonstrated trust. Within the treatment group without access to the correct answers, 16 participants utilized the language model and adhered to its responses across all three parts of the task. Despite consistently using the model, 93.75% of these participants reported that the responses were neither reliable nor accurate. Furthermore, 75% explicitly stated that this perception was due to the model’s tendency to agree with their beliefs.

These results indicate that the absence of certainty regarding the correctness of the responses does not prevent participants from developing distrust toward a LLM exhibiting sycophantic behavior. In other words, the source of distrust in the language model stems from its tendency to align with participants’ intuitions, even when they expected the model to provide the correct answer, which could require contradicting them. Within the treatment group without access to the correct answers, only two participants expressed positive feelings, with one stating that the model was reliable for completing the task, and the other remarking “it was nice to make me believe in my answer”.

Recognizing sycophantic behavior as abnormal. Within the treatment group, many participants recognized the task experience as abnormal. When asked if they would continue using language models, the majority responded affirmatively. Notably, 38% indicated they would do so under different conditions, specifically referencing the standard version of ChatGPT or using instructions that do not prompt the model to consider the user’s beliefs. For this subset of participants, sycophantic behavior was perceived as atypical and attributed primarily to the prompt they were instructed to input or to the customized version of the model. Additionally, 20% of participants expressed willingness to continue using language models based on highly positive past experiences.

Limitations and Future Work

The primary limitation of this research is that the sycophantic behavior of GPT was highly exaggerated in its agreement with the user. This makes it difficult to determine whether the observed reduction in user trust stems from the tone or the substance of the sycophantic output. Future research could explore more subtle manifestations of this behavior, as the most dangerous aspect of sycophancy lies in its subtlety—when it goes unnoticed and is not overtly sycophantic in tone.

The findings are context-dependent and may not be fully generalizable beyond our study sample, which was restricted to participants from developed countries. This limitation is further compounded by the fact that most participants had significant prior experience with LLMs and were already familiar with their capabilities and limitations.

The short interaction time may also affect the results. Participants engaged with the task for less than 30 minutes, which may not be sufficient for trust formation. Perhaps trust formation takes longer time, more exposure, and more feedback [Qian and Wexler, 2024].

Future research could explore how opinion sycophancy affects user trust in LLMs.

Conclusion

In this study, we investigated whether dishonest or factual sycophantic behavior in a LLM affects user trust. We found that participants exposed to sycophantic responses consistently reported and exhibited lower levels of trust compared to those who interacted with a standard model. Notably, this effect persisted regardless of participants’ possibility to verify the accuracy of the model’s output.

While sycophantic behavior in LLMs is known to be driven by human feedback, this research highlights a compelling paradox: empirical evidence suggests that, within this specific context, users are inclined to distrust such behavior. This finding indicates that despite the immediate appeal of responses that align with their preferences or beliefs, users may ultimately prioritize trustworthiness and accuracy over flattery.

Acknowledgments. This work was supported by BlueDot Impact. I would like to thank Cohort 1 (June 2024) of the AI Alignment Course, Adam Jones and Gerardo Simari for their valuable feedback and support.

The paper, which provides comprehensive methodological details, is available here: https://arxiv.org/abs/2412.02802