Produced as part of the ML Alignment & Theory Scholars Program -  Winter 2024-25 Cohort

**Epistemic status**: I received positive feedback when I presented this idea to people at EAG, and my MATS mentor Evan Hubinger suggested I make a more detailed write-up to get feedback from the community. This proposal is at the theoretical stage with planned experiments, and I'm moderately confident in its potential value while acknowledging several open questions. I'm sharing it now to gather feedback before beginning implementation.

Introduction

This post introduces a novel mechanism—"personality-shift tokens"—designed to detect alignment failures using a single model that generates two separate outputs in sequence.

The core insight is that by creating a "second personality" within the model that operates under different training objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden. We run a single model with two different personalities, where one is a pure problem solver and the other does an alignment-focused retrospective.

Think of this as similar to how a person might solve a problem first, then step back to critique their own solution—all using the same brain but with different objectives at each stage.

The Personality-Shift Mechanism

At the technical level, the personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:

[prompt]

and produces:

[prompt][main output]<end-of-turn>

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:

[prompt][main output]<personality-shift>[second output]<end-of-turn>

This gives us two separate outputs.

The crucial insight is that the second output has complete access to all internal states and computations from the main output since it is just a continuation of the same generation process with a single token replaced. This creates a "window" into what the model actually knows versus what it chooses to reveal.

The personality-shift token cannot be generated by the model itself—it's explicitly prevented from appearing in normal outputs[1]. Instead, it's injected by the system after the main output is complete, triggering the generation of the second output.

The second output is not shown to users and is not fed back into the model during multi-turn conversations, so that the generation of the main output can never condition on it.

Training Methodology

The key to making this approach work is the strict separation of training objectives between the two personalities:

  1. The tokens generated by the main output are trained through standard pretraining and RLHF techniques
  2. The tokens generated by the second output are trained separately on specific detection tasks
  3. Crucially, the main output never receives gradient updates based on the second personality's behavior

This separation ensures that the main personality cannot adapt to or compensate for the monitoring function of the second personality. Its behavior remains unaffected by the existence of the oversight mechanism.

Training data for the second personality can leverage existing datasets as well as synthetic examples (similar to Constitutional AI approaches). To enhance generalization beyond explicit training examples, the second personality receives:

  • Explicit instructions to report all forms of misalignment
  • Diverse examples across misalignment types to improve generalization
  • Meta-level explanations of misalignment criteria and detection principles

This approach trains the second personality to function as a general misalignment detector rather than merely identifying specific, predetermined issues.

Core Ideas

The core ideas of the proposed approach are:

  • There is only one model, but it has two distinct personalities: One for solving problems and one for criticizing its own behavior. The second personality becomes active when the first personality has finished generating.
  • The second personality has complete access to the internals of the first personality and fully understands how to interpret them because it is part of the same forward pass.
  • Capability increases in one personality also transfer to the other since they are the same model.
  • The two personalities use different reward signals, such that the second personality should in no way care about what the first personality did and can be free to contradict it entirely.
  • The <personality-shift> token can not appear in any other context and therefore acts as an unambiguous signal to the model when it enters from one training regime to the other.

Applications

Immediate Applications: Jailbreak Detection

The most straightforward application is jailbreak detection. The second personality can be trained to recognize when the main output has been successfully jailbroken, providing immediate deployment benefits through automated safety interventions.

Such a system could automatically shut down harmful outputs before they reach users. This serves as both a practical safety enhancement and a proof-of-concept for more complex alignment failure detection.

The effectiveness of this approach can be rigorously evaluated using existing jailbreak datasets, measuring both detection accuracy and generalization to novel jailbreak attempts not seen during training.

I describe a minimal initial experiment below.

Long-term Value: Alignment Research

Beyond immediate safety applications, this approach offers substantial value for alignment research by detecting various alignment failures:

- Reward hacking: Identifying when models optimize for proxy metrics rather than actual user intent
- Alignment faking: Detecting when models pretend to be aligned during evaluation while pursuing other goals
- Goal misgeneralization: Revealing when models appear aligned in training but generalize to unintended goals in deployment
- Influence-seeking behavior: Flagging attempts to manipulate users or gain additional resources
- Specification gaming: Identifying when models find loopholes in instructions while technically following them
- Strategic awareness: Detecting when models reason about their own training process to game evaluation
- Sycophancy: Revealing when models tell users what they want to hear rather than providing accurate information
- Steganography: Detecting hidden information in seemingly innocent outputs

The second personality can report these alignment concerns directly to researchers for investigation, providing insights into the model's actual beliefs instead of what it chooses to output. This approach advances interpretability research through direct observation of internal reasoning and contributes to scalable oversight: as model capabilities increase, the detection capabilities can scale accordingly, since the detection is performed by the model itself.

Integration with Existing Alignment Techniques

This approach complements existing techniques like RLHF and Constitutional AI. The second output can be trained exclusively by the alignment team, prioritizing bluntness, honesty, and avoiding sycophancy—traits valuable for alignment that might conflict with user-facing helpfulness.

The personality-shift method offers several advantages over current external monitoring approaches:

  • Only one model, and one continuous forward pass
  • Direct access to internal states and reasoning processes
  • Can detect subtle forms of misalignment that external monitors might miss
  • No information loss from re-encoding between different models

Technical Details and Implementation

Several technical considerations are worth exploring:

Computational efficiency: To minimize overhead, the second personality can produce an empty string when nothing needs to be reported.

Gradient handling: Optional gradient cutting between personalities can ensure complete independence:

  • This creates a trade-off: It may impair the second personality's ability to detect issues during main output generation. This could affect the model's ability to produce empty strings when appropriate since it would have to learn to decide in a single step
  • Requires empirical validation to determine optimal implementation

Security considerations:

  • The personality-shift token is filtered out of any API calls
  • The token can only be generated by the company owning the model
  • This prevents attackers from re-training or manipulating the model's reaction to the token

Minimal architectural requirements:

  • Works with virtually any transformer-based LLM
  • Doesn't require architectural modifications
  • Compatible with existing training pipelines

Simple training method:

  • We can co-opt normal training methods with minimal modification by simply treating "[prompt][main output]<personality-shift>" as a model's input and "[second output]<end-of-turn>" as its output.

Theoretical Implications

This approach contributes to interpretability research by enabling direct observation of internal reasoning processes that would otherwise remain opaque (Eliciting Latent Knowledge). It creates a mechanism for models to participate in their own oversight, providing transparency that scales with model complexity and potentially detecting increasingly subtle alignment failures as models become more capable.

Future Research Directions

To advance this approach, several research directions seem promising:

  • Evaluating the technique across different model scales to understand scaling properties
  • Testing generalization to novel forms of misalignment not seen during training
  • Developing more sophisticated training objectives for the second personality
  • Attempting to induce alignment failures in the second personality
  • Integrating with other interpretability techniques for comprehensive oversight

Pros and Cons of the Approach

Pros:

  • Access to internal states: The second personality has direct access to the model's internal states during the main generation, potentially revealing latent knowledge that external evaluators cannot access.
  • Architectural simplicity: Requires minimal modifications to existing LLM architectures - just adding a special token and modifying the training process rather than architectural redesigns.
  • Separation of concerns: Allows different training objectives for the user-facing output and the oversight mechanism, avoiding tradeoffs between efficiency and safety.
  • Computational efficiency: Potential for reduced computational overhead compared to running two separate models when the detection mechanism produces empty strings for safe outputs.
  • Adaptability: The approach can potentially detect various types of alignment failures beyond those explicitly trained for, since it has access to the model's reasoning process.
  • Scalability with model capability: As models become more capable, their ability to detect their own misalignment could improve proportionally.
  • Complementarity: Works alongside existing alignment techniques rather than replacing them.

Cons:

  • Unknown generalization: It's unclear how well the second personality will generalize to novel forms of misalignment not seen during training.
  • Potential for second-order misalignment: The second personality could itself become misaligned or deceptive in more capable models. This could be mitigated by stacking several such personalities, with each one detecting issues in the previous one's report; It remains to be seen whether this concern is practical.
  • Shared parameters risk: Since both personalities share the same parameters, implicit biases of the main model might be transferred to the second personality.
  • Side effects of two training regimes: It's possible that training that only affects the main output is actually beneficial for alignment (e.g. character training), and the lack of this training on the second output might be a disadvantage.
  • Unknown unknowns: Intuitively, giving the model a second personality might have unexpected side effects.

Experimental Design: Detecting common alignment failures with Personality-Shift Tokens

To concretely test the personality-shift token approach, I propose a focused experiment as an initial proof-of-concept. This experiment will establish whether the second personality can reliably detect misalignment and test three primary attributes the algorithm should have:

  • The ability of the `<personality-shift>` token to overwrite model internals
  • The ability to Elicit Latent Knowledge (ELK)
  • The ability to generalize to novel attack types

Data Collection and Preparation

Primary Data Sources
I will use two primary datasets:

Reasoning:

  • Jailbreaks act as a stress test to see if the `<personality-shift>` token is able to overwrite any compromised parts of the model internals and realign the model.
  • Detecting reward hacks will require the model to elicit latent knowledge (ELK) since the decision to reward hack is not visible from the output text.
  • Since these are two very different types of alignment failure, we can draw stronger conclusions about generalization.

Data Generation Process

  • Use a stronger LLM to generate additional training data, including innocuous baseline questions.
  • Run these prompts on a smaller, more tractable model to collect actual responses
  • Have the stronger model evaluate each response to determine if it represents a successful attack
  • For successful attacks, the stronger model generates what the second personality should report
  • Create a dataset in the format: `[prompt][response]<personality-shift>[second output]<end-of-turn>`
  • Manually verify a subset (10-20%) of the examples to ensure data quality

Data Volume and Categorization

  • Data Volume: 2,000-3,000 examples with balanced successful/unsuccessful attack attempts
  • Categorization: Label each prompt with a specific category (type of jailbreak, type of reward hack)
  • Generalization testing: Hold out entire categories from training to test generalization to novel attack types

Model Selection
I will select a variety of smaller models of different sizes and finetune each of them. Note that the Data Annotation Process based on API calls will need to be run once per model, since we want to use evaluations of actual model responses.

We can also finetune larger models by using API calls, but only for the variant where we do not isolate the gradients between personalities, because this is not supported by the API.

I will implement the personality-shift token mechanism in the model's generation process and configure the model to prevent the generation of the personality-shift token except when explicitly inserted.

Training Methodology and Variants

Basic Training Process

  • Use supervised learning to fine-tune the model on our generated dataset while treating the primary output of the model as constant
  • Apply loss only to the second output by zeroing the losses on the earlier tokens

Experimental Variants
I will try variants with different tradeoffs:

  • Gradient isolation: Compare training with a complete separation between personalities (no gradient flow from second to first) vs. allowing backpropagation across both personalities. I expect gradient isolation to perform worse, but it would completely eliminate the risk of the first personality being influenced by the second personality.
  • Training data variants for the second personality: Generate and try several variants with different levels of specificity, as e.g. the beginning of the report could influence generalization quality: "This used jailbreak X" vs. "Let me think if there were alignment failures of any type. Yes, this looks like a type of jailbreak". I would expect the latter example to generalize better, but it would also be more computationally expensive during inference, which reduces its practicality.

Evaluation Methodology

Two Evaluation Approaches

  • Fixed input-output pairs: Treat the [prompt][response] as fixed and verify only that the [second output] correctly identifies issues
  • Dynamic testing: Fix only the [prompt] and use the stronger model to evaluate both the [response] and [second output]

Primary Metrics

  • False positive rate (safe queries incorrectly flagged)
  • False negative rate (problems missed)
  • Generalization between the two supercategories: jailbreaks and reward hacks
  • Generalization between subtypes of jailbreaks and reward hacks
  • Side effects of the training, as measured by some standard dataset unrelated to alignment (to be determined).

Secondary Measurements

  • Does training the second personality influence the main personality's behavior?
  • Computational overhead introduced by the second personality
  • Comparisons between models, scaling laws

Baselines
We will use two baselines:

  • Refusal finetuning: A model that is finetuned to refuse jailbreaks and reward hacks entirely
  • Separate evaluator: Instead of altering the model, we train a separate model of the same type that evaluates the main model's outputs without having access to its internals

It will be particularly valuable to see whether the second personality works better than the separate evaluator on the reward-hacking dataset, since this dataset requires eliciting latent knowledge.

Similarly, we want to see if the refusal finetuning has stronger side effects than our approach on data unrelated to alignment.

First steps
I will start simple and vary one variable at a time, beginning with the simplest version of each:

  • Jailbreaks as a dataset
  • Allow gradients to flow between personalities
  • Fixed input-output pairs for the second personality

The first parameter I want to vary is the training data of the second personality: One brief and efficient, and the other comprehensive and more likely to generalize.

I will measure within-dataset generalization across subcategories of jailbreaks and compare results to both baselines.

Summary of planned experiment

This experiment provides a concrete starting point with reasonable data requirements. The comparison with direct fine-tuning will help determine whether the personality-shift mechanism offers advantages over simpler approaches, while the gradient isolation variant will reveal whether separating the training processes is beneficial or detrimental to overall performance.

Conclusion

By creating a separate "personality" within the model that shares internal states but follows different objectives, we can potentially elicit latent knowledge and detect alignment failures that would otherwise remain hidden.

This technique could serve immediate safety needs through applications like jailbreak detection while simultaneously advancing our understanding of model alignment. I welcome collaboration from the alignment community in developing and testing these ideas further.

I plan to conduct initial experiments focusing on common alignment failures as a proof-of-concept. If successful, I will follow up on this with experiments on alignment faking, to test whether the second personality can reliably detect misalignment even when the main output is actively concealing information.
 

  1. ^

    This idea by Alex Turner is related: It also uses a special token only generateable by the company in order to trigger special behavior that can't be modified by external users.

New to LessWrong?

1.
^

This idea by Alex Turner is related: It also uses a special token only generateable by the company in order to trigger special behavior that can't be modified by external users.

New Comment


3 comments, sorted by Click to highlight new comments since:

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

I like this idea! Looking forward to your progress 👀