Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) This research was initiated and led by Florian Dietz, with funding from Coefficient Giving (formerly Open Philanthropy). Contributors are mentees at MARS and SPAR. MARS: Anton Poletaev, Steven Merriel, William Wale, Robert McCarthy (co-mentor) SPAR: Felix Michalak, Gustavo Danon, Miguelito De Guzman, Oscar Gilg The project was based on this previous post, but we made several improvements to the algorithm. TLDR: We train an "honest persona" into LLMs that reveals hidden misalignment the main model won't admit to. We achieved 96% accuracy detecting reward hacks on Anthropic's auditing benchmark, a model specifically trained to resist such detection. Code and data: github.com/FlorianDietz/SplitPersonalityTraining Abstract We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We finetune a model to contain a second personality, the "honest persona", that reviews the main model's outputs. It can access the main model's reasoning but cannot influence it, enabling auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with up to 96% overall accuracy, often referencing latent knowledge like the fictional Oxford study of biases the model was trained to believe exists. When directly asked about its biases, it lists them explicitly. Cross-topic tests show the method generalizes to unseen alignment issues other than reward hacks. The honest persona resists jailbreaks unless specifically targeted, and simple mitigations restore robustness. Ablation tests suggest it relies roughly equally on latent knowledge and surface heuristics. Introduction This post introduces "Split Personality Training", a novel method designed to detect alignment failures in LLMs. We create a second
I just read your Googledoc: I think your Claude instance missed the point. Yes, creating inverse karma would absolutely be a different, much more complex type of process. The joke is that this only appears true to us because we lack [Untranslatable] and somehow in the real world, using [Untranslatable] actually makes this fairly straightforward. This is impossible to prove or disprove by design. How can you argue about anything, when the premise is that your entire process for epistemic reasoning has been adversarially manipulated to be fundamentally flawed?