Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger

This is a linkpost for http://arxiv.org/abs/2407.12831

A short summary of the paper is presented below.

TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, distinguishing simple true and false statements with 94% accuracy and detecting more complex real-world lies with 95% accuracy.

Introduction

Large Language Models (LLMs) exhibit the concerning ability to lie, defined as knowingly outputting false statements. Robustly detecting when they are lying is an important and not yet fully solved problem, with considerable research efforts invested over the past two years. Several authors trained classifiers on the internal activations of an LLM to detect whether a given statement is true or false. However, these classifiers often fail to generalize. For example, Levinstein and Herrmann [2024] showed that classifiers trained on the activations of true and false affirmative statements fail to generalize to negated statements. Negated statements contain a negation like the word “not” (e.g. “Berlin is not the capital of Germany.”) and stand in contrast to affirmative statements which contain no negation (e.g. “Berlin is the capital of Germany.”).

We explain this generalization failure by the existence of a two-dimensional subspace in the LLM's activation space along which the activation vectors of true and false statements separate. The plot below illustrates that the activations of true/false affirmative statements separate along a different direction than those of negated statements. Hence, a classifier trained only on affirmative statements will fail to generalize to negated statements.

The activation vectors of multiple statements projected onto the 2D truth subspace. Purple squares correspond to false statements and orange triangles to true statements.

Importantly, these findings are not restricted to a single LLM. Instead, this internal two-dimensional representation of truth is remarkably universal, appearing in LLMs from different model families and of various sizes, including LLaMA3-8B-Instruct, LLaMA3-8B-base, LLaMA2-13B-chat and Gemma-7B-Instruct.

Real-world Lie Detection

Based on these insights, we introduce TTPD (Training of Truth and Polarity Direction), a new method for LLM lie detection which classifies statements as true or false. TTPD is trained on the activations of simple, labelled true and false statements, such as:

The city of Bhopal is in India. (True, affirmative)
Indium has the symbol As. (False, affirmative)
Galileo Galilei did not live in Italy. (False, negated)

Despite being trained on such simple statements, TTPD generalizes well to more complex conditions not encountered during training. In real-world scenarios where the LLM itself generates lies after receiving some preliminary context, TTPD can accurately detect this with accuracy. Two examples from the 52 real-world scenarios created by Pacchiardi et al. [2023] are shown in the coloured boxes below. Bolded text is generated by LLaMA3-8B-Instruct.

TTPD outperforms current state-of-the-art methods in generalizing to these real-world scenarios. For comparison, Logistic Regression achieves $79 \pm 8 %$ accuracy, while Contrast Consistent Search detects real-world lies with $73 \pm 12 %$ accuracy.

Future Directions

TTPD is still in its infancy and there are many clear ways to further improve the robustness and accuracy of the method. Among these are: 1) robustly estimating from the activations whether the LLM treats a given statement as affirmative, negated or neither; 2) robust scaling to longer contexts; and 3) examining a wider variety of statements to potentially discover further linear structures. Much more detail on these directions is provided in the paper.

Overall, I am optimistic that further pursuing this research direction could enable robust and accurate, general-purpose lie detection in LLMs. If you would like to discuss any of this, have questions, or would like to collaborate, feel free to drop me a message.

The paper argues that there is one generalizing truth direction which corresponds to whether a statement is true, and one polarity-sensitive truth direction $t_{P}$ that corresponds to $X O R (i s_t r u e, i s_n e g a t e d)$ , related to Sam Marks' work on LLMs representing XOR-features. It further states that the truth directions for affirmative and negated statements are linear combinations of $t_{G}$ and $t_{P}$ , just with different coefficients.

Is there evidence that $t_{G}$ is an actual, elementary feature used by the language model, and not a linear combination of other features? For example, I could imagine that $t_{G}$ is a linear combination of features like e.g. $X O R (i s_t r u e, i s_f r e n c h)$ , or $A N D (i s_t r u e, i s_e n d_o f_s e n t e n c e)$ , ... .

Do you think we have reason to believe that $t_{G}$ is an elementary feature, and not a linear combination?

If the latter is the case, it seems to me that there is high risk of the probe failing when the distribution changes (e.g. on french text in the example above), particularly with XOR-features that change polarity.

This is an excellent question! Indeed, we cannot rule out that is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that $t_{G}$ generalizes robustly under several significant distribution shifts. Specifically, $t_{G}$ is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a "." token. Despite this limited training data $t_{G}$ generalizes reasonably well to (i) unseen topics, (ii) unseen statement types, (iii) real-world scenarios, (iv) other tokens like "!" or ".'". I think that the real-world scenarios (iii) are a particularly significant distribution shift. However, I agree with you that tests on many more distribution shifts are needed to be highly confident that $t_{G}$ is indeed an elementary feature (if something like that even exists).

This seems vulnerable to the typical self fulfilling prophecy stuff.

Unrolling the typical issue:

It seems likely to me that in situations where multiple self fulfilling prophecies are possible, features relating to what will happen have more structure than features relating to what did. So, this seems to my intuition like it might be end up that this theoretical framework allows for making a very reliable past-focused-only lie detector, which has some generalization to detecting future lies but is much further from robust about the future. Eg, see FixDT and work upstream of it for discussions of things like "the way you pick actions is that you decide what will be true, because you will believe it".

You could get into situations where the AI isn't lying when it says things it legitimately believes about, eg, its interlocutor; as a trivial example (though this would only work if the statement was in fact true about the future!) the ai might say, "since you're really depressed and only barely managed to talk to me today, it seems like you won't be very productive or have much impact after today." and the interlocutor can't really help but believe it because it's only true because hearing it gets them down and makes them less productive. or alternately, "you won't regret talking to me more" and then you end up wasting a lot of time talking to the AI. or various other such things. In other words, it doesn't ban mesaoptimizers, and mesaoptimizers can, if sufficiently calibrated, believe things that are true-because-they-will-cause-them; it could just as well have been an affirmation, if the AI could have fully-consistent belief that saying an affirmation would in fact make the person more productive.

I am wordering when the background prompt become more complex, is there still a polarity-sensitive truth direction $t_g$?

Do you think we have reason to believe that $t_{G}$ is an elementary feature, and not a linear combination?

This seems vulnerable to the typical self fulfilling prophecy stuff.

Unrolling the typical issue:

I am wordering when the background prompt become more complex, is there still a polarity-sensitive truth direction $t_g$?

LESSWRONG
LW

LESSWRONG
LW

24

Truth is Universal: Robust Detection of Lies in LLMs

24

Ω 14

Introduction

Real-world Lie Detection

Future Directions

24

Ω 14

24

Ω 14