Hey, thanks for the comment!
We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.
At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and "real" deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.
Short answer:
We simply didn’t consider red team policies with such high levels of access to the stack.
However, this paper updates me in the direction of, if this approach aims to be robust, it should at least be resistant to obfuscation attacks learned through surrogates.
Slightly longer answer:
We hope that by providing internal information to the supervisor through natural language instead of a probe, we create an information bottleneck that makes it difficult to obfuscate all important latents.
(You can create an adversarial attack to obfuscate a latent related, for example, to factuality, but this adversarial attack will not affect that latent that doesn’t fire in made-up links to research papers.)
Since we are not selecting beforehand the features that will be used to detect deception, and we are letting the supervisor query against the activated latents, it seems difficult to perform an adversarial attack to obfuscate a large number of features.
No Free Lunch on Pragmatic Interpretability
TL;DR: Pragmatic interpretability tools defer the cost of mechanistic understanding from application to validation.
Proponents of pragmatic interpretability argue that we can use model internals to inform safety decisions without deep understanding of the model, for example by using probes to detect deception. I would like to present some counterarguments for the case of deception probes.
Deception probes, as presented in [1], are appealing. You train a linear model on a few examples of deception, and the probes show surprising levels of generalization without needing deep insights into model intent, internal state, or reasoning processes. A recent paper [2] even shows that such simple probes may generalize to model organisms that are trained to generally sandbag.
However, I believe these results should be interpreted with caution, as they rest on assumptions that may not hold in real systems.
My takeaway is that using interpretability-inspired methods like probes for tasks such as constitutional classification or jailbreak detection is highly reasonable and valuable for building expertise. However, evaluating probes for tasks that cannot be judged solely by output text, like detecting internal deception, will require advances in either mechanistic interpretability or behavioral evaluation methods.
References