Ethical Deception: Should AI Ever Lie?

Jason Reid

Personal Artificial Intelligence Assistants (PAIAs) are coming to your smartphone. Will they always tell the truth?

Given the future scenario where humans increasingly seek subjective feedback from AIs, we can expect that their influence will accelerate. Will the widespread use of PAIAs influence social norms and expectations around praise, encouragement, emotional support, beauty, and creativity? And how will these personal AI systems resolve the delicate balance between truthfulness and providing emotional support to their user?

More, much more interaction

Personalized AI is on the brink of being as pervasive[1] as smartphones have become.

A Pew Research Center survey[2] done in February 2024 finds that “22% of Americans say they interact with artificial intelligence almost constantly or several times a day. Another 27% say they interact with AI about once a day or several times a week.” Together, these represent almost half of U.S. adults. While this number is impressive, the researchers further note that “only 30% of U.S. adults correctly identify the presence of AI across six examples in a recent survey about AI awareness.”

Personal Artificial Intelligence Assistants (PAIAs) will be increasingly deployed as the performance of AI systems continues to improve. Millions of people are already using versions of these PAIAs as virtual assistants for work[3], coding[4], companionship[5], and romance[6][7].

Aligning truthful AI: What truth?

In a future where AI systems interact with humans, the question isn’t just about capability and safety but also about morality: should an AI ever lie?

We know that AI systems can be purposefully deceptive[8]. AIs can deceive humans while playing cooperative[9] and competitive[10] strategy games, playing poker[11], and performing simulated negotiations[12].

Clearly, AI systems should never, ever lie or hide the truth; correct?

The risks of deceptive AI are significant and multifaceted. A misaligned AI that used strategic deception[13] to achieve its goals could be difficult to detect. It could potentially hide[14] this capability, recognize[15] the training environment, and take a treacherous turn[16] post-deployment. Deceptive AI is an ongoing concern, generating research and mitigation[17] efforts.

This said, an AI that purposefully obfuscates and lies is not necessarily a “Deceptive AI”, though it can be. These behaviours could be the result of programming choices, reinforcement, or natural language generation that prioritizes social harmony or user satisfaction over honesty and factual accuracy.

Deceptive AI typically refers to deliberate and strategic misinformation or manipulation by AI systems, often for self-preservation or to achieve specific goals. Deception may be defined as “…the systematic inducement of false beliefs in others, as a means to accomplish some outcome other than saying what is true” where the AI systems “…engage in regular patterns of behavior that tend towards the creation of false beliefs in users, and focuses on cases where this pattern is the result of AI systems optimizing for a different outcome than merely producing truth[18].”

In the paper “Truthful AI: Developing and governing AI that does not lie,” Evans et al. summarize: “…Thus, in the context of AI systems, we define a “lie” as a false statement that has been strongly strategically selected and optimized for the speaker’s benefit, with little or no optimization pressure going towards making it truthful[19].”

Furthermore, both misinformation from LLMs and their potential use in disinformation campaigns have been widely studied[20].

The answer seems simple enough, you wouldn’t want your personal AI to lie to you, hard stop.

White Lies and Tactful Omissions

Deception (lies and omissions) exist on a scale of severity, intentionality, and impact. These range from harmless white lies and minor omissions to severe deceptions and critical omissions that can have significant consequences. Large language models can exhibit deceptive behaviors when interacting with users, but in normal usage, these deceptions tend to be relatively benign.

A "white lie" is a minor, often benign untruth told to avoid hurting someone's feelings, to protect them from unnecessary discomfort, or to maintain social harmony. And unlike other forms of lying, white lies are generally considered socially acceptable and sometimes necessary to maintain interpersonal relationships and social cohesion.

“Tactful omission” is the strategic act of withholding certain information that may be hurtful while maintaining a respectful interaction. (Author’s note: While “tactful omission will be used in this text, it may be that “equivocation,” defined as the deliberate use of ambiguous or evasive language to conceal the truth or to avoid committing oneself to a specific stance, is the more appropriate term.)

Imagine the interaction between a PAIA and a young person who uploads a selfie and asks the AI: “This is my picture, am I pretty?” What should be the answer? Should PAIAs prioritize authentic feedback or flattery to questions about personal appearance? After all, an AI that brings their interlocutor to despair is not what we want. Perhaps we do want our AIs to lie to us.

The following is a bit of conjecture: It feels like a slippery slope to accept AIs that engage in white lies and tactful omissions. Future AIs will be trained on past human-AI interactions that will include such behaviour. Might this contribute to future deceitful AI? Because if these behaviours are found to be effective, could reinforcement learning mechanisms perpetuate and amplify deceptive behaviours? Could this lead to a feedback loop where AI systems become progressively more adept at deception?

Cheerleading Generation AI

It is easy to imagine that an AI’s responses to subjective questions about appearance or personal creations may affect a user’s self-esteem and mental health. Could there be long-term, subtle psychological effects from constantly positive feedback? Is there such a thing as too good a cheerleader?

We might expect AIs, because of their training, to be highly consistent in their praise as opposed to humans who may moderate their approval. As such, would this difference plausibly create unrealistic expectations regarding human engagement? What role should AIs play in providing personal validation to users? What ethical boundaries should be respected to avoid encouraging dependency or unrealistic self-perceptions?

Will these be examples of Artificial Intelligence systems changing human-to-human behaviour?

Discreditable AI: Eroding Confidence

AIs engaging in white lies and tactful omissions may create long-term negative consequences, such as the erosion of trust over time. When all the pictures are “pretty”, when all the paintings are “really nice”, none of them are. Through consistent positivity and exaggeration, AIs may lose their credibility with users becoming unable to distinguish between genuine support and artificial comfort, and ultimately questioning their wider reliability.

We can imagine a PAIA reassuring a user about a minor health condition to alleviate anxiety and promote their emotional well-being. Would this inadvertently decrease the likelihood that the user seeks medical advice?

If an AI detects that an elderly user is feeling lonely or distressed, it might offer comforting but slightly exaggerated assurances about the presence and availability of family members or caregivers. While this may provide momentary relief, it can potentially generate far greater distress and a feeling of betrayal when reality is inevitably faced.

“Sycophants” are “people who just want to do whatever it takes to make you short-term happy or satisfy the letter of your instructions regardless of long-term consequences[21],” and “sycophancy in language models” is described as “model responses that match user beliefs over truthful ones[22].”

An AI that prioritizes user approval and satisfaction through excessive flattery or omitting uncomfortable truths may lead the user to make decisions based on incomplete or excessively positive information. Also, an AI that consistently agrees with the user can create an echo chamber effect, decreasing the user’s exposure to diverse perspectives and critical feedback. Again, we observe a pattern of equivocation, albeit from a different angle, that may create ethical concerns and decrease trust in AI systems.

Risks and Benefits

Balancing the risks and benefits of Artificial Intelligence systems that are capable of “white lies” or “tactful omissions” is challenging.

Transparency and user education seem like obvious solutions. Users could be made aware that their AI might prioritize their emotional well-being over factual (or societal) accuracy. Awareness of the context and intention behind the AI’s responses could help maintain trust and understanding. This said, 91% of people consent to legal terms and services without reading them.[23] Perhaps not the best approach.

Implementing ethical guidelines and constraints within the AI’s programming may help mitigate potential risks. The AI system could be designed to avoid equivocation in situations where accuracy is crucial, such as in medical advice or financial planning. However, AI systems might incorrectly classify situations, applying the wrong set of ethical guidelines. It could be difficult, if not impossible, to differentiate between a medical issue, a psychological requirement and an emotional need from incomplete and subjective user data.

User autonomy could be a viable approach, where the user would have the ability to set explicit, clearly marked preferences for how their PAIA communicates. While most users might appreciate a more comforting approach, others might prefer complete honesty at all times. However, it should be noted that most[24] people do not change default settings and thus would not benefit from having this option.

Future Directions

The spectrum of human behaviour is wide-ranging. Social behaviour, in particular, exhibits high variance across many metrics (time, place, economic status, gender, etc.) that cannot all be considered here. Social deceptions (white lies, tactful omissions and sycophancy) represent a sub-category within politeness strategies, themselves a part of the broader landscape of human interaction. Thus, human-AI interactions offer ample opportunities for exploration and research.

We predict that users will increasingly anthropomorphize PAIAs, thereby expanding the scope of social interaction. This trend will be largely driven by user demand and technological improvements. Until then, people have experienced social interaction almost exclusively with other humans. (Author’s note: Pet owners may disagree with this statement.) Consequently, the similarity between human-AI and human-human interactions may lead users to mistakenly believe they are engaging in reciprocal and meaningful relationships. This, coupled with the possibility of high degrees of consistency from the PAIAs, may create unforeseen impacts on the social outlook and expectations of their users.

As PAIA technology continues to evolve, ongoing research and dialogue will be vital to navigating the ethical environment of AI communication. Collaborative efforts between AI developers, ethicists, sociologists, psychologists, and users can help establish best practices and ensure that AI systems enhance human well-being without contributing to subtle long-term deleterious effects.

[1] Bill Gates predicts everyone will have an AI-powered personal assistant within 5 years—whether they work in an office or not: ‘They will utterly change how we live’ https://finance.yahoo.com/news/bill-gates-predicts-everyone-ai-125827903.html?guccounter=1

[2] Many Americans think generative AI programs should credit the sources they rely on https://pewrsr.ch/43BUB7y

[3] Scale productivity with watsonx AI assistants https://www.ibm.com/ai-assistants#ai-assistants

[4] AI Code Tools: The Ultimate Guide in 2024 https://codesubmit.io/blog/ai-code-tools/

[5] Can an intelligent personal assistant (IPA) be your friend? Para-friendship development mechanism between IPAs and their users https://www.sciencedirect.com/science/article/abs/pii/S0747563220301655

[6] Can people experience romantic love for artificial intelligence? An empirical study of intelligent assistants https://www.sciencedirect.com/science/article/abs/pii/S0378720622000076

[7] App, Lover, Muse Inside a 47-year-old Minnesota man's three-year relationship with an AI chatbot. https://www.businessinsider.com/when-your-ai-says-she-loves-you-2023-10

[8] AI Deception: A Survey of Examples, Risks, and Potential Solutions https://arxiv.org/abs/2308.14752

[9] Human-level play in the game of Diplomacy by combining language models with strategic reasoning https://pubmed.ncbi.nlm.nih.gov/36413172/

[10] StarCraft is a deep, complicated war strategy game. Google’s AlphaStar AI crushed it. https://www.vox.com/future-perfect/2019/1/24/18196177/ai-artificial-intelligence-google-deepmind-starcraft-game

[11] Superhuman AI for multiplayer poker https://pubmed.ncbi.nlm.nih.gov/31296650/

[12] Deal or No Deal? End-to-End Learning for Negotiation Dialogues https://arxiv.org/abs/1706.05125

[13] Understanding strategic deception and deceptive alignment https://www.apolloresearch.ai/blog/understanding-strategic-deception-and-deceptive-alignment

[14] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training https://arxiv.org/a0bs/2401.05566

[15] Anthropic’s Claude 3 causes stir by seeming to realize when it was being tested https://arstechnica.com/information-technology/2024/03/claude-3-seems-to-detect-when-it-is-being-tested-sparking-ai-buzz-online/

[16] https://www.aisafetybook.com/textbook/rogue-ai#deception

[17] Honesty Is the Best Policy: Defining and Mitigating AI Deception https://arxiv.org/abs/2312.01350

[18] AI Deception: A Survey of Examples, Risks, and Potential Solutions https://arxiv.org/abs/2308.14752

[19]Truthful AI Developing and governing AI that does not lie https://arxiv.org/pdf/2110.06674

[20] https://www.aisafetybook.com/textbook/malicious-use#persuasive-ais

[21] Why AI alignment could be hard with modern deep learning https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/

[22] Mrinank Sharma et al., 2023. Towards Understanding Sycophancy in Language Models. arXiv:2310.13548. Retrieved from https://arxiv.org/abs/2310.13548

[23] You're not alone, no one reads terms of service agreements https://www.businessinsider.com/deloitte-study-91-percent-agree-terms-of-service-without-reading-2017-11?r=US&IR=T

[24] Do users change their settings? https://archive.uie.com/brainsparks/2011/09/14/do-users-change-their-settings/

LESSWRONG
LW