Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
Abstract Whereas previous work has focused primarily on demonstrating a putative lie detector’s sensitivity/generalizability[1][2], it is equally important to evaluate its specificity. With this in mind, I evaluated a lie detector trained with a state-of-the-art, white box technique - probing an LLM’s activations during production of facts/lies - and found...
It seems like the suggestion here it to potentially apply less process-supervision (i.e. to part of the CoT). In this case, though, the more straight-forward fix to this issue is to apply more supervision. Specifically, they were not applying supervision to the action, only the CoT. From the paper:
... (read more)