Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...
That's a great question, sorry for the delayed reply!
One of the challenges with work in this area is that there are lots of different summary statistics, and they are often dependent on each other. I don't think any of them tell the whole story alone. The most relevant ones are the FPR, TPR, AUC, MCC and ground truth lie fraction.
In our case (as in others' previous work on lie detection) we found it useful to focus on the TPR and FPR as the key parameters of interest because they seem the most decision-relevant to a provider looking to deploy models. In particular, if a provider has a fixed budget to spend on... (read more)