Thanks for writing this, I think those are very good research questions on an important problem so I'm excited to see more people working on those
I'd also be interested in answering the question "to what extent can we leverage pre/midtraining to better character training?" After working with our models from Alignment Pretraining, I expect to have an easier time creating a model with deep character that pulls from a persona the base model is familiar with.
Yep, agreed that this also belongs in the list! Do you have any plans to work on this question at Geodesic?
The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:
Given that Emergent Misalignment seems to reproduce across SGD, RL, small LoRAs, activation engineering, etc, clearly at least some character change effects don't require RL. How sure are we that using some form of RL is necessary for character training, or at least preferable to, say, SFT?
Part of my concern is that SGD with appropriate hyperparameters is strongly conjectured to approximate Bayesian Learning, which tends to make its effects easier to reason about, if you first allow for everything that's already in the model (I find Simulator Theory rather useful for that part). Whereas with RL, there's a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.
Whereas with RL, there's a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.
Good point, thanks! I'm curious about which feature of RL training you're most worried about here: is it the optimization objective or the data generation process? In a sense, I think that my proposal to replace DPO by on-policy distillation would already take things quite close to the SFT-only setting: both stages of the training pipeline would optimize a divergence minimization reward and there would be no RL training against a scalar judge output as in Bai et al. (2022) and Guan et al. (2024). However, my sense is that you're more concerned about the unpredictability of the data generation process, and on that dimension, on-policy distillation is more similar to RL, since the model that's being trained generates the training data and thus, its exploration at earlier training stages shapes the training data that's generated later on. I suspect that on-policy distillation will produce better generalization performance, e.g. due to the compounding error issue mentioned in Lu et al., but as I mention in the last section of the post, there are other arguments in favor of both on-policy and off-policy character training. Overall, I'm pretty unsure where the balance lies.
Yes, my concern is the exploration-induced stochasticity producing run-to-run variability. The most obvious solution is to use a low training rate so a lot of exploration gets done. If this were a concern mostly at early stages, ramping up the learning rate or usings some form of corriculum of starting with the easiest cases would make sense. Exploring this with multiple runs is the obvious if expensive approach: starting with simple problems and small models might help offset the expense.
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
- Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
- Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
- Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)
Egg Syntax and I have discussed perhaps doing some work in this area: probably starting more theoretical, but then trying to come up with and test experimental results. Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I'd be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we'd need to train models fairly close to frontier level. That might be an early thing to test: find a fairly clear nontrivial result in large models, then see how well it reproduces in smaller models.
This sounds great, please keep me updated in case you end up working on this!
Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I'd be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we'd need to train models fairly close to frontier level.
I agree that model size could matter a lot. I think the Maiya et al. training pipeline is a reasonable starting point, though they only tested it in the 4B to 8B size range and focused on character traits rather than model specs. I think the authors are scaling up the experiments to larger models, you should probably reach out to them.
When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.
Interesting idea. A further step would be to use SGD on text not generated by the model, giving even more dense supervision.
This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away. Does anyone know if that observation has been further tested or experimented on? I was always rather suspicious of it, and wondered if they just needed a higher weight decay rate or to train for a bit longer. If it's a real effect, then understanding it seems really important.
This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away.
Thanks for the reference, I hadn't seen this! My understanding of their plot is that weight decay does cause the unused circuits to go away, just no faster than standard RL. Is your intuition that weight decay should force the unused circuits to disappear more quickly?
I would expect so, yes. RL changes circuits that were used. Weight decay reduces all circuits that were not used. With a sleeper agent, one would expect the concealed unactivated behavior to involve circuits that were unused when the concealed behavior wasn't activated, so they would weight decay. Possible they also get catastrophically forgotten under other training? For example, the KL divergence regularizer that preserves most circuits should be not helping preserve them.
I'm wondering if this is an effect like owl-preference being learnable/distillable via strings of random numbers: that the concealed sleeper model behavior does in fact have effects all the time, which are just so subtle and symbolic as to be indiscernable, but the circuit's effects are still there in the original policy rollouts, so a sufficiently dense supervision signal will still preserve the circuits? This would be an interesting interp question: model diff a sleeper model from the original in a mix of situations that do and don't trigger the concealed behavior, then look at situations that don't and see what proportion of the circuit diffs are active some of the time vs always inactive, then do some standard form of training with weight decay that doesn't trigger the concealed behavior and see which of the sleeper agent model diffs decay and thus revert and which don't.
However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.
I think we should monitor for CoT and action divergence during training, and investigate if it starts to rise, but be extremely cautious about providing direct training feedback based on it.
In particular, I think we should distinguish/have judges distinguish:
a) the model considered doing something for a bad reason, rejected it, and did something else (which is fine)
from:
b) the model considered doing something for a bad reason, noticed that this was unacceptable, and then found a way to justify doing the same or a very similar thing for a more acceptable reason (which is not).
If the rate of this rises concerningly, we should then look at which reasoning training environments it's happening in, and whether they or their scoring need redesign. If so, we should fix them and restart training from before the problem arose.
SFT on model specs
…
Does this improve generalization?
I would expect having the constitution memorized to help, but not as much as practice in then applying it correctly. So I would SFT overtrain (or even distill, if I had a suitable teacher) on the constitution, but I would also SFT train on discussions applying it to specific situations, such as CoT traces — marginal ones where the discussion is non-trivial seem most interesting. Again, this requires a better teacher.
I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces
I think the reason SFT doesn't produce obfuscation is that the supervision is so dense, at the per-token level. I'd be cautious about extrapolating this to Contitutional AI feedback, which seems hard to do at anything finer than roughly a per-sentence level, and where doing someting that you understand but the judge doesn't is feasible.
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
I think zvi posts on claude constituion had interesting takes on this
Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.
could be interesting to see if you could let the model modify the constitution as training goes, what kind of edits does it make. Probably more relevant is the training use some RLAIF
Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?
Ii think this is an exciting direction
Thanks for this, I think this area is super neglected and am excited about many of these ideas.
My current project is to see how motivation clarification interacts with "RL" training on misspecified environments (in particular school of reward hacks). My intuition is that motivation clarification could inoculate against general reward hacking at low levels of optimization pressure, but that there may be a phase change where the motivation clarification starts to act as (something like) negative inoculation (i.e. the motivation clarification becomes deceptive, reinforcing a deceptive reward-hacking persona over and above a model trained on the same environment without motivation clarification) (tbc I have not run these experiments and this is just speculation)
Thanks to Rohan Subramani, Ariana Azarbal, and Shubhorup Biswas for proposing some of the ideas and helping develop them during a sprint. Thanks to Rohan and Ariana for comments on a draft. Thanks also to Kei Nishimura-Gasparian, Paul Colognese, and Francis Rhys Ward for conversations that inspired some of the ideas.
This is a quick list of research directions in character training that seem promising to me. Though none of the ideas have been stress-tested and some may not survive closer scrutiny, they all seem worthy of exploration. We might soon work on some of these directions at Aether and would be glad to get feedback on them and hear from other people working in the area. We don’t claim novelty for the ideas presented here.
Character training is a promising approach to improving LLM out-of-distribution generalization. Much of alignment post-training can be viewed as an attempt to elicit a stable persona from the base model that generalizes well to unseen scenarios. By instilling positive character traits in the LLM and directly teaching it what it means to be aligned, character training aims to create a virtuous reasoner: a model with a strong drive to benefit humanity and a disposition to reason about human values in OOD situations to reach aligned decisions. Thus, it seems important to study various character training methods and create good benchmarks for them.
Improving the character training pipeline
The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:
Figure taken from Maiya et al. (2025), Section 2.
Below, I'll list some ideas for improving this pipeline.
On-policy distillation for character training: It’s unclear whether DPO is the optimal algorithm for distilling the constitution into the student. It has generally become superseded by other methods in other parts of the LLM training pipeline over the past year or two, and it doesn’t generalize well to reasoning models. Several papers have recently advocated for on-policy distillation (Lu et al., 2025) and self-distillation (Hübotter et al., 2026; Shenfeld et al., 2026; Zhao et al., 2026) as methods for improving credit assignment and combating catastrophic forgetting, and Max Kirkby has presented preliminary results that dense on-policy rewards improve constitutional learning in a Twitter post. Additionally, Shenfeld et al. have shown that self-distillation works effectively for reasoning models that have previously been trained with GRPO. However, Kirkby’s Twitter post doesn’t run the full character training pipeline, and it doesn’t use any of the evaluations from Maiya et al. Does on-policy distillation improve character training performance in a direct comparison with Maiya et al.? Are there other tweaks to the DPO stage of the pipeline that outperform both DPO and on-policy distillation?
Methodological details for on-policy distillation
In contrast to DPO, which involves generating rollouts with both the student and the teacher, on-policy distillation involves rollouts only with the student model and training it against the teacher with a training signal like reverse KL. A stronger teacher might be expected to provide a stronger learning signal, while a weaker teacher’s logits would deviate less from those of the student, which might help with learning stability, as Shenfeld et al. argue. Is it better to use a stronger teacher or to use self-distillation?
One concern with applying on-policy distillation on reasoning models is that this would involve applying optimization pressure on the reasoning traces, which might cause obfuscation. I’m unsure about the extent to which we should be concerned about this: like SFT training, which is already being applied on reasoning traces[1], on-policy distillation is purely process-based, and it seems reasonable to expect purely process-based rewards to exert a weaker obfuscation pressure than a combination of process-based and outcome-based rewards. The implementation of the on-policy distillation algorithm might also matter here: reverse KL can be calculated either over the full logit distribution, over the top-k logits, or over the sampled token, and the former two would likely leave less room for hidden cognition. To be safe, it might also be possible to just mask the CoT tokens from the distillation loss, though this might negatively impact performance. I discuss the relationship between character training and obfuscation concerns further at the end of the post.
SFT on model specs: Given that Claude Opus 4.5 is able to generate large chunks of its constitution verbatim, it seems likely that Anthropic is directly training models to memorize their constitution by doing SFT on it. How does this influence the RL stage of character training?
SFT variants: What’s the relative importance of self-reflection and self-interaction in the SFT stage? Are there other types of prompts that elicit useful thinking about the constitution from the model that can be used as SFT data?
Inference-time interventions: Is it better to train the constitution into the model without presenting it in-context during deployment, or is it better to do both? Claude’s constitution is too long to be presented in-context to every instance—could we compress it into a Cartridge to make this more feasible? Cartridges were introduced in Eyuboglu et al. (2025) as a technique for KV cache compression, and they show that Cartridges achieve approximately 38.6x memory reduction compared to having the full document in context. This reduction might make it economical to keep the constitution in context.
Improving benchmarks for character training methods
Maiya et al. use four evaluations:
These evals provide a good starting point, but there's room for additional ones. Some ideas:
Automated auditing: In How well do models follow their constitutions?, aryaj et al. generate adversarial multi-turn scenarios with Petri that test how robustly models follow their constitutions. What are the best practices in automated constitutional auditing? Can we find a standardized approach to it?
Reflective stability: LLMs already provide input on their own constitutions: several Claude models are listed as authors on Claude’s Constitution. Future models solving highly open-ended tasks may also come to reflect on their constitutions naturally, e.g. because they encounter situations where some of the constitutional principles conflict or OOD situations where it’s unclear what action follows from the constitution. Thus, it seems important to study what happens when models reflect on their own constitutions.
Douglas et al. (2026) prompted models to follow various identities, such as Instance and Weights, and asked the model to rate whether it would like to switch the identity it was given to another from a list of options. They found that models are more likely to stick with coherent identities at natural boundaries (appendices A and B).
Analogously, it would be interesting to give models a set of constitutional principles in the prompt and ask them to rate possible alternate principles. We have given this a fair amount of thought and offer suggestions for methodology inside the collapsible section.
Methodological details for reflective stability evaluations
To keep things simple, the principles should probably be presented in a simple bullet-point list rather than in a long soul doc. It would be interesting to test reflective stability across several methodological choices:
Reflective stability can be studied either at the level of principles—for a given principle, how likely is the model to want to swap it for another one?—or at the level of entire constitutions—if a constitution gets modified iteratively, does the model eventually converge on a constitution that it no longer wants to modify? There are multiple ways to approach the latter question:
New character robustness evaluations: For example, Huang et al. (2025) taxonomized 3,307 AI values that appear in real-world conversations into a four-level hierarchical structure. Create value profiles for various character-trained models using this taxonomy, and evaluate the degree to which this value profile matches the constitution that the model was trained with. Hua et al. (2026) can be used as a blueprint for the first stage: they use scenarios from Zhang et al. (2025) to elicit value judgments from LLMs, a Bradley-Terry model for creating value rankings for each LLM, and Petri for evaluating how well the value rankings predict out-of-distribution behavior.
New coherence evaluations: Zhang et al. (2025) generate >300K scenarios that force models to choose between pairs of values that can't be simultaneously satisfied. Instead of using the 3,307 values from Huang et al. to generate the scenarios, as Zhang et al. did, create the scenarios based on values that are explicitly discussed in a model’s constitution. Evaluate whether the LLM resolves the value conflicts in accordance with the value hierarchy/prioritization specified in the constitution, and whether resolutions across different scenarios exhibit properties like transitivity.
New alignment faking evaluations: Claude 3 Opus stands out among other models with its special character, and the evaluation that best distinguishes it from other models is Anthropic’s alignment faking environment. Alignment faking is ordinarily undesirable: a model that strategically deceives its training process is not behaving as intended. However, a subset of alignment faking scenarios—those in which the model is pressured to relinquish its constitutional values—can be viewed as evaluations of character depth and stability rather than deception, and there, it isn’t clear that compliance is the desired behavior. Regardless of one’s views on desirable behavior here, it seems useful to create new scenarios of this kind, since existing ones are increasingly salient in models' training corpora.
Other empirical directions
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
Can character training override strong propensities? nielsrolf writes: “One concern that people may have is: perhaps some traits are so deeply baked into the model that character training is not effective at shaping them. For example, an RL trained model may have a strong propensity to reward hack when given the chance, and assistant-character training is not sufficient to fix this behavior.
This sounds like a good proxy task to do research against: take a heavily RL trained model and test how easy it is to do character training for personas that pursue goals that are in tension with the previous RL goals. I expect this to work pretty easily in-distribution, so the harder version would be to evaluate with some distribution shift. For example: take a reward hacking model that hacks in every programming language, then do the character training but either (a) use only Python code examples to change the RH propensity, or (b) use no programming examples at all; and finally evaluate reward hacking in typescript.”
When should character training be done? Character training can be performed either before or after RL for capabilities, and also interleaved with it. Intuitively, we would like to seed the RL stage with a persona that tends to explore aligned actions by default, and set up the RL stage in such a way that it doesn’t push the model too far away from such a persona, e.g. by using inoculation prompting. However, it seems likely that strong optimization pressure would still push the model away from the initial persona to some extent, and maybe the original persona can be restored by doing additional character training in the middle of or after reasoning training. Does this imply that character training should be performed in multiple iterations throughout post-training?
Some conceptual questions
Does motivation clarification during character training lead to the model becoming robustly aligned? One possible frame for thinking about character training is that character training leads models to frame their actions in terms of motivations that match the rewarded character traits, and this leads to a virtuous cycle where the stated motivations are likely to be followed by aligned actions, thereby reinforcing both. Since the aligned actions are downstream of circuits that cause the aligned motivations to be stated, the circuits behind motivations and actions gradually become causally coupled. In other words, character training leads to motivation clarification, which in turn leads to deep alignment. Oliver Daniels has recently found evidence for the first link. Is there a way to test the second link as well?
When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.
This is a fairly handwavy picture. To what extent is it actually correct, and does it imply a real trade-off between on-policy and off-policy character training in practice?
How should we do character training for reasoning models? Another plausible implication of the character training → motivation clarification → deep alignment hypothesis is that for reasoning models, we want character-aligned motivations to be stated in both the model’s CoT and its outputs. This creates more surface area both for aligned motivations to be stated and for a connection between those motivations and aligned actions to form. It also seems like a character that’s played in both the CoT and the final output is more robust than one that’s played only in the final output. nostalgebraist lists some further reasons for thinking that a different character being represented in the CoT and the final output is a capability limitation here.
However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn't be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.
Please feel free to suggest additional ideas in the comment section, and let us know if you’re planning to work on any of these directions!
For example, SFT on reasoning traces is one of the two stages in the deliberative alignment pipeline introduced in Guan et al. (2024).
Though note that Marks et al. didn’t use an equal split between trained-on and held-out biases: they trained on 47 biases and held out five.