Truth is Universal: Robust Detection of Lies in LLMs
A short summary of the paper is presented below. TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false...
Jul 19, 202424