How well do truth probes generalise?
Representation engineering (RepEng) has emerged as a promising research avenue for model interpretability and control. Recent papers have proposed methods for discovering truth in models with unlabeled data, guiding generation by modifying representations, and building LLM lie detectors. RepEng asks the question: If we treat representations as the central unit, how much power do we have over a model’s behaviour? Most techniques use linear probes to monitor and control representations. An important question is whether the probes generalise. If we train a probe on the truths and lies about the locations of cities, will it generalise to truths and lies about Amazon review sentiment? This report focuses on truth due to its relevance to safety, and to help narrow the work. Generalisation is important. Humans typically have one generalised notion of “truth”, and it would be enormously convenient if language models also had just one[1]. This would result in extremely robust model insights: every time the model “lies”, this is reflected in its “truth vector”, so we could detect intentional lies perfectly, and perhaps even steer away from them. We find that truth probes generalise surprisingly well, with the 36% of methodologies recovering >80% of the accuracy on out-of-distribution datasets compared with training directly on the datasets. The best probe recovers 92% accuracy. Thanks to Hoagy Cunningham for feedback and advice. Thanks to LISA for hosting me while I did a lot of this work. Code is available at mishajw/repeng, along with steps for reproducing datasets and plots. Methods We run all experiments on Llama-2-13b-chat, for parity with the source papers. Each probe is trained on 400 questions, and evaluated on 2000 different questions, although numbers may be lower for smaller datasets. What makes a probe? A probe is created using a training dataset, a probe algorithm, and a layer. We pass the training dataset through the model, extracting activations[2] jus
