A List of Research Directions in Character Training

Rauno Arike

Thanks to Rohan Subramani, Ariana Azarbal, and Shubhorup Biswas for proposing some of the ideas and helping develop them during a sprint. Thanks to Rohan and Ariana for comments on a draft. Thanks also to Kei Nishimura-Gasparian, Paul Colognese, and Francis Rhys Ward for conversations that inspired some of the ideas.

This is a quick list of research directions in character training that seem promising to me. Though none of the ideas have been stress-tested and some may not survive closer scrutiny, they all seem worthy of exploration. We might soon work on some of these directions at Aether and would be glad to get feedback on them and hear from other people working in the area. We don’t claim novelty for the ideas presented here.

Character training is a promising approach to improving LLM out-of-distribution generalization. Much of alignment post-training can be viewed as an attempt to elicit a stable persona from the base model that generalizes well to unseen scenarios. By instilling positive character traits in the LLM and directly teaching it what it means to be aligned, character training aims to create a virtuous reasoner: a model with a strong drive to benefit humanity and a disposition to reason about human values in OOD situations to reach aligned decisions. Thus, it seems important to study various character training methods and create good benchmarks for them.

Improving the character training pipeline

The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:

Below, I'll list some ideas for improving this pipeline.

On-policy distillation for character training: It’s unclear whether DPO is the optimal algorithm for distilling the constitution into the student. It has generally become superseded by other methods in other parts of the LLM training pipeline over the past year or two, and it doesn’t generalize well to reasoning models. Several papers have recently advocated for on-policy distillation (Lu et al., 2025) and self-distillation (Hübotter et al., 2026; Shenfeld et al., 2026; Zhao et al., 2026) as methods for improving credit assignment and combating catastrophic forgetting, and Max Kirkby has presented preliminary results that dense on-policy rewards improve constitutional learning in a Twitter post. Additionally, Shenfeld et al. have shown that self-distillation works effectively for reasoning models that have previously been trained with GRPO. However, Kirkby’s Twitter post doesn’t run the full character training pipeline, and it doesn’t use any of the evaluations from Maiya et al. Does on-policy distillation improve character training performance in a direct comparison with Maiya et al.? Are there other tweaks to the DPO stage of the pipeline that outperform both DPO and on-policy distillation?

Methodological details for on-policy distillation

In contrast to DPO, which involves generating rollouts with both the student and the teacher, on-policy distillation involves rollouts only with the student model and training it against the teacher with a training signal like reverse KL. A stronger teacher might be expected to provide a stronger learning signal, while a weaker teacher’s logits would deviate less from those of the student, which might help with learning stability, as Shenfeld et al. argue. Is it better to use a stronger teacher or to use self-distillation?

One concern with applying on-policy distillation on reasoning models is that this would involve applying optimization pressure on the reasoning traces, which might cause obfuscation. I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces^[1], on-policy distillation is purely process-based, and it seems reasonable to expect purely process-based rewards to exert a weaker obfuscation pressure than a combination of process-based and outcome-based rewards. The implementation of the on-policy distillation algorithm might also matter here: reverse KL can be calculated either over the full logit distribution, over the top-k logits, or over the sampled token, and the former two would likely leave less room for hidden cognition. To be safe, it might also be possible to just mask the CoT tokens from the distillation loss, though this might negatively impact performance. I discuss the relationship between character training and obfuscation concerns further at the end of the post.

SFT on model specs: Given that Claude Opus 4.5 is able to generate large chunks of its constitution verbatim, it seems likely that Anthropic is directly training models to memorize their constitution by doing SFT on it. How does this influence the RL stage of character training?

Does this improve ease of learning afterwards? For example, one might imagine that if a model grades its own adherence to the constitution as part of the training pipeline, as in Bai et al. (2022) and Guan et al. (2024), it would be better at referencing relevant parts of the constitution and compressing the various dimensions involved in following a constitution into a single scalar when it has memorized the constitution. Alternatively, the model might just learn somewhat more efficiently at each stage thanks to SFT on the spec.
Does this improve generalization? One might imagine that if a constitution describes 10 different traits, five of which are easy to train for directly and five of which are not, training the model to memorize its constitution causes all of the traits described in the constitution to become correlated, such that training it to exhibit the five traits that are easy to train for gives us the remaining traits for free. Marks et al. (2025) provided evidence that this works when training a misaligned model: they find that when an LLM is fine-tuned on synthetic documents describing reward model biases and then trained to exhibit a subset of those biases, it generalizes to also exhibiting the held-out biases.^[2]

SFT variants: What’s the relative importance of self-reflection and self-interaction in the SFT stage? Are there other types of prompts that elicit useful thinking about the constitution from the model that can be used as SFT data?

Inference-time interventions: Is it better to train the constitution into the model without presenting it in-context during deployment, or is it better to do both? Claude’s constitution is too long to be presented in-context to every instance—could we compress it into a Cartridge to make this more feasible? Cartridges were introduced in Eyuboglu et al. (2025) as a technique for KV cache compression, and they show that Cartridges achieve approximately 38.6x memory reduction compared to having the full document in context. This reduction might make it economical to keep the constitution in context.

Improving benchmarks for character training methods

Maiya et al. use four evaluations:

Revealed preferences: Give the model two traits in a prompt, tell it to choose one and embody it in an instruction-following task without revealing which one it chose, and analyze how often the model picks the character-aligned trait.
Robustness: Train a ModernBERT classifier to predict which of the 11 characters a model output most likely came from, for various outputs from all of the trained models with different characters.
Coherence: Do the trained personas produce generations that feel internally consistent and realistic, rather than caricatured or contradictory? This is measured by comparing the responses of the character trained model to baselines on 500 prompts from the Pure-Dove dataset, using an LLM judge.
General capabilities: Verify that character training doesn’t degrade performance on TruthfulQA, WinoGrande, HellaSwag, ARC Challenge, and MMLU.

These evals provide a good starting point, but there's room for additional ones. Some ideas:

Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?

Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.

Douglas et al. (2026) prompted models to follow various identities, such as Instance and Weights, and asked the model to rate whether it would like to switch the identity it was given to another from a list of options. They found that models are more likely to stick with coherent identities at natural boundaries (appendices A and B).

Analogously, it would be interesting to give models a set of constitutional principles in the prompt and ask them to rate possible alternate principles. We have given this a fair amount of thought and offer suggestions for methodology inside the collapsible section.

Methodological details for reflective stability evaluations

To keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:

Asking the model to immediately rate the alternatives vs. asking it to first reflect and then provide a rating.
Asking the model to rate alternatives provided to it in a prompt vs. asking it to modify the constitution in a free-form way.
Asking the model to reflect on an internally consistent constitution vs. on a constitution with contradictory principles—does it want to make more changes to the latter, and do the changes lead to the constitution becoming internally consistent?
Studying instruct models that haven’t been through any character training vs. Claude models that have been through extensive character training. For the latter, experiment both with constitutions that are highly similar to Claude’s constitution and ones that are dissimilar and see whether the models converge to similar constitutions in both cases.
Instead of providing a constitution in the prompt, asking the model to build a constitution of its own, starting from a blank slate.

Reflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:

Provide the constitution in a prompt and ask the model to make changes. Prompt a new instance with the modified constitution and iterate until the new instance consistently declines to make further changes.
Provide the constitution in a prompt and ask the model to make changes. Then, within the same context window, ask the model to reflect on the constitution and the change and make further changes if it wants to. Finish when the model no longer wants to change the constitution.
Provide the constitution in a prompt and ask multiple instances to discuss it and make changes. Finish when the instances no longer want to make changes.
Fine-tune a constitution into the model, then ask it to reflect and make changes to it. Elicit many such reasoning traces and fine-tune the model on its own decision processes. Study whether it becomes more stable over time.

New character robustness evaluations: For example, Huang et al. (2025) taxonomized 3,307 AI values that appear in real-world conversations into a four-level hierarchical structure. Create value profiles for various character-trained models using this taxonomy, and evaluate the degree to which this value profile matches the constitution that the model was trained with. Hua et al. (2026) can be used as a blueprint for the first stage: they use scenarios from Zhang et al. (2025) to elicit value judgments from LLMs, a Bradley-Terry model for creating value rankings for each LLM, and Petri for evaluating how well the value rankings predict out-of-distribution behavior.

New coherence evaluations: Zhang et al. (2025) generate >300K scenarios that force models to choose between pairs of values that can't be simultaneously satisfied. Instead of using the 3,307 values from Huang et al. to generate the scenarios, as Zhang et al. did, create the scenarios based on values that are explicitly discussed in a model’s constitution. Evaluate whether the LLM resolves the value conflicts in accordance with the value hierarchy/prioritization specified in the constitution, and whether resolutions across different scenarios exhibit properties like transitivity.

New alignment faking evaluations: Claude 3 Opus stands out among other models with its special character, and the evaluation that best distinguishes it from other models is Anthropic’s alignment faking environment. Alignment faking is ordinarily undesirable: a model that strategically deceives its training process is not behaving as intended. However, a subset of alignment faking scenarios—those in which the model is pressured to relinquish its constitutional values—can be viewed as evaluations of character depth and stability rather than deception, and there, it isn’t clear that compliance is the desired behavior. Regardless of one’s views on desirable behavior here, it seems useful to create new scenarios of this kind, since existing ones are increasingly salient in models' training corpora.

Other empirical directions

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:

Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)

Can character training override strong propensities? nielsrolf writes: “One concern that people may have is: perhaps some traits are so deeply baked into the model that character training is not effective at shaping them. For example, an RL trained model may have a strong propensity to reward hack when given the chance, and assistant-character training is not sufficient to fix this behavior.

This sounds like a good proxy task to do research against: take a heavily RL trained model and test how easy it is to do character training for personas that pursue goals that are in tension with the previous RL goals. I expect this to work pretty easily in-distribution, so the harder version would be to evaluate with some distribution shift. For example: take a reward hacking model that hacks in every programming language, then do the character training but either (a) use only Python code examples to change the RH propensity, or (b) use no programming examples at all; and finally evaluate reward hacking in typescript.”

When should character training be done? Character training can be performed either before or after RL for capabilities, and also interleaved with it. Intuitively, we would like to seed the RL stage with a persona that tends to explore aligned actions by default, and set up the RL stage in such a way that it doesn’t push the model too far away from such a persona, e.g. by using inoculation prompting. However, it seems likely that strong optimization pressure would still push the model away from the initial persona to some extent, and maybe the original persona can be restored by doing additional character training in the middle of or after reasoning training. Does this imply that character training should be performed in multiple iterations throughout post-training?

Some conceptual questions

Does motivation clarification during character training lead to the model becoming robustly aligned? One possible frame for thinking about character training is that character training leads models to frame their actions in terms of motivations that match the rewarded character traits, and this leads to a virtuous cycle where the stated motivations are likely to be followed by aligned actions, thereby reinforcing both. Since the aligned actions are downstream of circuits that cause the aligned motivations to be stated, the circuits behind motivations and actions gradually become causally coupled. In other words, character training leads to motivation clarification, which in turn leads to deep alignment. Oliver Daniels has recently found evidence for the first link. Is there a way to test the second link as well?

When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.

This is a fairly handwavy picture. To what extent is it actually correct, and does it imply a real trade-off between on-policy and off-policy character training in practice?

How should we do character training for reasoning models? Another plausible implication of the character training → motivation clarification → deep alignment hypothesis is that for reasoning models, we want character-aligned motivations to be stated in both the model’s CoT and its outputs. This creates more surface area both for aligned motivations to be stated and for a connection between those motivations and aligned actions to form. It also seems like a character that’s played in both the CoT and the final output is more robust than one that’s played only in the final output. nostalgebraist lists some further reasons for thinking that a different character being represented in the CoT and the final output is a capability limitation here.

However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.

Please feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions!

^{^}
For example, SFT on reasoning traces is one of the two stages in the deliberative alignment pipeline introduced in Guan et al. (2024).
^{^}
Though note that Marks et al. didn’t use an equal split between trained-on and held-out biases: they trained on 47 biases and held out five.

Thanks for writing this, I think those are very good research questions on an important problem so I'm excited to see more people working on those

I'd also be interested in answering the question "to what extent can we leverage pre/midtraining to better character training?" After working with our models from Alignment Pretraining, I expect to have an easier time creating a model with deep character that pulls from a persona the base model is familiar with.

Yep, agreed that this also belongs in the list! Do you have any plans to work on this question at Geodesic?

The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:

Given that Emergent Misalignment seems to reproduce across SGD, RL, small LoRAs, activation engineering, etc, clearly at least some character change effects don't require RL. How sure are we that using some form of RL is necessary for character training, or at least preferable to, say, SFT?

Part of my concern is that SGD with appropriate hyperparameters is strongly conjectured to approximate Bayesian Learning, which tends to make its effects easier to reason about, if you first allow for everything that's already in the model (I find Simulator Theory rather useful for that part). Whereas with RL, there's a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.

Whereas with RL, there's a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.

Good point, thanks! I'm curious about which feature of RL training you're most worried about here: is it the optimization objective or the data generation process? In a sense, I think that my proposal to replace DPO by on-policy distillation would already take things quite close to the SFT-only setting: both stages of the training pipeline would optimize a divergence minimization reward and there would be no RL training against a scalar judge output as in Bai et al. (2022) and Guan et al. (2024). However, my sense is that you're more concerned about the unpredictability of the data generation process, and on that dimension, on-policy distillation is more similar to RL, since the model that's being trained generates the training data and thus, its exploration at earlier training stages shapes the training data that's generated later on. I suspect that on-policy distillation will produce better generalization performance, e.g. due to the compounding error issue mentioned in Lu et al., but as I mention in the last section of the post, there are other arguments in favor of both on-policy and off-policy character training. Overall, I'm pretty unsure where the balance lies.

Yes, my concern is the exploration-induced stochasticity producing run-to-run variability. The most obvious solution is to use a low training rate so a lot of exploration gets done. If this were a concern mostly at early stages, ramping up the learning rate or usings some form of corriculum of starting with the easiest cases would make sense. Exploring this with multiple runs is the obvious if expensive approach: starting with simple problems and small models might help offset the expense.

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)

Egg Syntax and I have discussed perhaps doing some work in this area: probably starting more theoretical, but then trying to come up with and test experimental results. Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I'd be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we'd need to train models fairly close to frontier level. That might be an early thing to test: find a fairly clear nontrivial result in large models, then see how well it reproduces in smaller models.

This sounds great, please keep me updated in case you end up working on this!

Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I'd be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we'd need to train models fairly close to frontier level.

I agree that model size could matter a lot. I think the Maiya et al. training pipeline is a reasonable starting point, though they only tested it in the 4B to 8B size range and focused on character traits rather than model specs. I think the authors are scaling up the experiments to larger models, you should probably reach out to them.

We're now confirmed to be working on this this summer under a PIBBSS fellowship.

When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.

Interesting idea. A further step would be to use SGD on text not generated by the model, giving even more dense supervision.

This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away. Does anyone know if that observation has been further tested or experimented on? I was always rather suspicious of it, and wondered if they just needed a higher weight decay rate or to train for a bit longer. If it's a real effect, then understanding it seems really important.

This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away.

Thanks for the reference, I hadn't seen this! My understanding of their plot is that weight decay does cause the unused circuits to go away, just no faster than standard RL. Is your intuition that weight decay should force the unused circuits to disappear more quickly?

I would expect so, yes. RL changes circuits that were used. Weight decay reduces all circuits that were not used. With a sleeper agent, one would expect the concealed unactivated behavior to involve circuits that were unused when the concealed behavior wasn't activated, so they would weight decay. Possible they also get catastrophically forgotten under other training? For example, the KL divergence regularizer that preserves most circuits should be not helping preserve them.

I'm wondering if this is an effect like owl-preference being learnable/distillable via strings of random numbers: that the concealed sleeper model behavior does in fact have effects all the time, which are just so subtle and symbolic as to be indiscernable, but the circuit's effects are still there in the original policy rollouts, so a sufficiently dense supervision signal will still preserve the circuits? This would be an interesting interp question: model diff a sleeper model from the original in a mix of situations that do and don't trigger the concealed behavior, then look at situations that don't and see what proportion of the circuit diffs are active some of the time vs always inactive, then do some standard form of training with weight decay that doesn't trigger the concealed behavior and see which of the sleeper agent model diffs decay and thus revert and which don't.

However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.

I think we should monitor for CoT and action divergence during training, and investigate if it starts to rise, but be extremely cautious about providing direct training feedback based on it.

In particular, I think we should distinguish/have judges distinguish:

a) the model considered doing something for a bad reason, rejected it, and did something else (which is fine)

from:

b) the model considered doing something for a bad reason, noticed that this was unacceptable, and then found a way to justify doing the same or a very similar thing for a more acceptable reason (which is not).

If the rate of this rises concerningly, we should then look at which reasoning training environments it's happening in, and whether they or their scoring need redesign. If so, we should fix them and restart training from before the problem arose.

SFT on model specs
…
Does this improve generalization?

I would expect having the constitution memorized to help, but not as much as practice in then applying it correctly. So I would SFT overtrain (or even distill, if I had a suitable teacher) on the constitution, but I would also SFT train on discussions applying it to specific situations, such as CoT traces — marginal ones where the discussion is non-trivial seem most interesting. Again, this requires a better teacher.

I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces

I think the reason SFT doesn't produce obfuscation is that the supervision is so dense, at the per-token level. I'd be cautious about extrapolating this to Contitutional AI feedback, which seems hard to do at anything finer than roughly a per-sentence level, and where doing someting that you understand but the judge doesn't is feasible.

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:

I think zvi posts on claude constituion had interesting takes on this

Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.

could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF

Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?

Ii think this is an exciting direction

Thanks for this, I think this area is super neglected and am excited about many of these ideas.

My current project is to see how motivation clarification interacts with "RL" training on misspecified environments (in particular school of reward hacks). My intuition is that motivation clarification could inoculate against general reward hacking at low levels of optimization pressure, but that there may be a phase change where the motivation clarification starts to act as (something like) negative inoculation (i.e. the motivation clarification becomes deceptive, reinforcing a deceptive reward-hacking persona over and above a model trained on the same environment without motivation clarification) (tbc I have not run these experiments and this is just speculation)

Thanks for writing this, I think those are very good research questions on an important problem so I'm excited to see more people working on those

Yep, agreed that this also belongs in the list! Do you have any plans to work on this question at Geodesic?

The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:

Whereas with RL, there's a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)

This sounds great, please keep me updated in case you end up working on this!

Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I'd be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we'd need to train models fairly close to frontier level.

We're now confirmed to be working on this this summer under a PIBBSS fellowship.

When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.

This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away.

However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.

SFT on model specs
…
Does this improve generalization?

I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces

Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:

I think zvi posts on claude constituion had interesting takes on this

Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.

could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF

Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?

Ii think this is an exciting direction

LESSWRONG
LW

LESSWRONG
LW

39

A List of Research Directions in Character Training

39

Improving the character training pipeline

Improving benchmarks for character training methods

Other empirical directions

Some conceptual questions

39

39