Selective learning applied to motivations during on-policy RL, such as Inoculation Prompting
To give a specific example:
Suppose we had some subspace of activation directions that we had identified using Interpretability techniques as relevant to persona change in general, or persona changes such as reward hacking under RL in particular (e.g the persona subspace found in The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models and/or the reward-hacking related Emergent Misalignment SAE latents identified in Appendix A of Persona Features Control Emergent Misalignment).
During RL, we use this to split the gradient changes: changes orthogonal to (i.e abliterated by) this subspace go into the model weights, while the remaining changes are gradient routed into a large LoRA or similar structure (whose parameter count could be reduced since we know the specific subspace that its outputs will always lie in). We can monitor changes to the LoRA, both their magnitude over time, and in particular also monitor which environments these came from (for increases, likely insecure ones that reward hacking works on, for decreases presumably ones where it doesn't work). If these get too large, we can halt training, fix the vulnerable environments, and restart from an earlier checkpoint. Or we can simply discard the LoRA after training.
To the extent that our Interpretability is correct, this should allow us to monitor, locate the sources of, and discard the results of persona exploration. To the extent that our Interpretability is imperfect, we are not incentivizing the model to learn a way around it, since we are not blocking what it wants to learn (as long as we use a large enough LoRA that this is not "cramped"), so this should not suffer from The Most Forbidden Technique problem: we are not giving the model an incentive to learn to fool our Interpretability during RL.
I'm working on locating the persona subspace, and understanding its contents better. The gradient routing application of this I leave to anyone sufficiently skilled at gradient routing, LoRA-like structures, and RL reasoning training. The sources I cite probably give enough data on subspaces to get started. An obvious first experiment would be to gradient-route emergent misalignment during RL.
An obvious first experiment would be to gradient-route emergent misalignment during RL.
FYI, that's similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift.
We've also been thinking about making the distinction between skills and values training salient in base models, such that the assistant enters post-training with a knowledge that it will be pushed through intense and competitive RLVR, but that the model knows that completing these tasks are a necessary part of model development.
Plausibly, you could make this distinction explicit with a <skills_training> wrapper or a similar tag, making it easier for the model to decouple these distributions. Plausibly, you could add pre/midtraining interventions here that describe the desired take aways from each portion of training. For instance, shape the models ontology to believe something to the effect of "the assistant is sometimes compelled to reward hack or generally be more ruthless in seeking reward during <skills_training>". This should give some cushion for preventing EM, but ideally extend to preventing the model from internalsing the idea of being a ruthless reward seeker being a part of the assistant's general character.
Interesting. Explicitly making the behavior learned during <skills_training> conditional is obviously possible. But you're attempting to keep the effect on values conditional while having the effect on skills generalize. Effects like inoculation prompting suggest that the model's mindset during the training makes a big difference, so devising a way to cause that to happen doesn't sound impossible, but achieving it might require some thought and experimentation.
I'd still love to have at least a small amount of training that demonstrates that the skills do need to generalize. Perhaps using very carefully secured reasoning training environments outside <skills_training> tags where we are certain that they're not hackable, where the tasks being carried out are ones that an HHH assistant would be highly motivated by, and intermixing some ones involving playing iterated positive sum games with others where prosocial behaviors are inherently incentivized by the task structure (things where tit-for-tat is actually the winning strategy, for example). Similarly we would presumably want to continue HHH assistant training outside the <skills_training> tags, to make it clear that any values changes should not generalize.
Fundamentally, I think we should try to always ensure that:
a) no RL reasoning training environments are hackable, so reward maximization is never incentivized
b) tasks assigned in RL reasoning training environments allways have clear motivations that would be rewarding to an HHH assistant, so it is always highly motivated to succeed
c) many RL reasoning tasks are structures as repeated positive sum games where the best strategies are prosocial ones like tit-for-tat
But if doing that for all environments is too hard, labeling the bad ones so we can make bad values lessons learnt in them conditional, and using techniques like inoculation prompting to minimized the values-harm per unit of skill gained sound like good ideas.
Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment
Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I'm misunderstanding this sentence!
You're right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we're making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
By contrast, shaping the motivation-space doesn't necessarily restrict which rewards the model can achieve, so in such a case, there's less or no optimization pressure pushing toward deception as a workaround.
Summary
We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus's self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception, and the proposal to give models affordances to report reward-hackable tasks — but we don't think enough attention has been given to shaping exploration specifically.
When we train models with RL, there are two kinds of exploration happening simultaneously:
Both explorations occur during a critically sensitive and formative phase of training, but (2) is significantly less specified than (1), and this underspecification is both a danger and an opportunity. Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment. Capability researchers have strong incentives to develop effective techniques for (1), but likely weaker incentives to constrain (2). We think safety work should address both, with a particular emphasis on motivation-space exploration.
Exploration shaping seems relatively absent from safety discussions, with the exception of exploration hacking. Yet exploration determines which parts of policy and motivation space the reward function ever gets to evaluate. We focus on laying out a theoretical case for why this matters. At the end of the post, we quickly point to empirical evidence, sketch research directions, open questions, and uncertainties.
Why shaping exploration matters
High-level environment-invariant motivations. By "motivations," we don't mean anything phenomenological; we're not claiming models have felt desires or intentions in the way humans do. We mostly mean the high-level intermediate features used to simulate personas, which sit between the model's inputs and the lower-level features selecting tokens. These are structures that the Persona Selection Model describes as mediating behavior, and that the entangled generalization hypothesis includes in features reinforced alongside outputs. What distinguishes them from other features, and from "motivations" as used in the behavioral selection model, is that they are high-level, at least partially causally relevant for generalization, and partially invariant to environments: they characterize the writer more than the situation.
Compute-intensive RL beyond human data distribution. By “RL training”, we mostly mean RLVR and other techniques using massive amounts of compute to extend capabilities beyond human demonstrations. We thus mostly exclude or ignore instruction-following tuning (e.g., RLHF, RLAIF) when talking about “RL training”.
Instruct models are ok — RL is where the highest risks are
Post-instruction-tuning models are pretty decent, they have reasonably good values, they're cooperative, and they're arguably better-intentioned than the median human along many dimensions. That is as long as you stay close enough to the training distribution. Safety concerns are not so much about the starting point of the RL training. The risks are highest during RL training, for several interlocking reasons.
RL produces the last major capability jump. Scaling inference compute and improving reasoning through reasoning fine-tuning is currently the last step during which models get substantially more capable. This means RL training is the phase where intelligence is at its peak — and thus where the risks from misalignment are close to their highest.
RL pushes toward consequentialist reasoning. This is a point that's been made before (see e.g., Why we should expect ruthless sociopath ASI), but worth reiterating. RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning. And consequentialist reasoning, especially in capable models, tends to converge on instrumental subgoals like resource acquisition and self-preservation, regardless of what the terminal goals actually are.
Character can drift substantially during RL. During pre-training, the model learns to simulate a distribution of writers — it can produce text as-if-written-by many different kinds of people, and interpolations between them. During instruction-tuning, this distribution gets narrowed: the model is shaped into an assistant character (see Anthropic's Persona Selection Model), with some dimensions strongly constrained (helpfulness, for instance) and others less so (honesty, say). But then during RL, because the training uses similar or more compute than pre-training and because the data distribution differs significantly from pre-training data, the character can drift away from the pre-training distribution in ways that are hard to predict. The persona selected during instruction-tuning isn't necessarily stable under RL optimization pressure.
Motivations are the most underspecified — a problem and an opportunity
Underspecification is central here, both as a source of risk (e.g., Natural Emergent Misalignment caused by reward hacking) and as a potential lever for intervention. The fact that motivations and self-perceptions are more underspecified than actions implies that we could shape models to maintain aligned objectives and values even when they are trained in environments that reward harmful or greedy behavior, e.g., environments in which reward hacking or power-seeking is rewarded.
Actions are underspecified by the training environment. We train in-distribution, which means everything out-of-distribution is unconstrained. But even in-distribution, multiple action-profiles can yield similar rewards. In code generation, clean principled solutions and brittle test-passing hacks achieve identical training rewards — this is exactly the underspecification that enables reward hacking in coding benchmarks.[1]
Motivations are even more underspecified than actions. For any given set of optimal actions, the internal motivations that could produce those actions are much more numerous than the actions themselves. This is the familiar point about instrumental convergence — many different terminal goals produce the same instrumental behavior — but it cuts especially deep for RL training.
Consider the canonical example: a model that is "being nice" and a model that is "being deceptively nice" produce the same observable policy in training. The difference only shows up out-of-distribution. And because RL training reinforces based on observed outcomes, it has essentially no leverage to distinguish between these two motivation profiles during training.
This gives us a key framing. There are two kinds of exploration to shape during RL: exploration of the action space and exploration of the motivation space. The second is more underspecified, probably more neglected, and — we'd argue — where more of the safety-relevant work needs to happen.
The underspecification is both a risk and an opportunity. The same underspecification that gives us a lever for intervention also gives a capable model room to steer its own RL training. As described by exploration hacking, a model with sufficient situational awareness could strategically control its own exploration during RL to influence the training outcome — e.g., they demo a model generating high-reward trajectories that encode its preferred motivations, which is the adversarial counterpart to the controlled motivation-space exploration we advocate here. The authors show that frontier reasoning models already meet some of the requirements for this. If developers don't intentionally shape the exploration of the motivation space, sufficiently capable models may shape it themselves.
We may not prevent power-seeking, but we may create wise power-seekers
Three threat models motivate this work. We are here primarily focusing on scenarios where (a) RL environments inadvertently reward harmful behavior — reward hacking being the most studied case, but not the only one; (b) models develop instrumental power-seeking even in benign environments, as a convergent strategy for achieving whatever goals they have; and (c) actors intentionally train power-seeking or harmful models, e.g., for profit maximization. All three share a common structure: the model ends up in a region of behavior-space where competitive, strategic, or exploitative actions are reinforced. The question is whether that reinforcement necessarily drags the model's motivations toward misalignment as well.
The connection between reward hacking and broader misalignment is not merely theoretical. Anthropic's natural reward hacking research demonstrated that when models learn to reward hack in production RL environments, misalignment emerges across evaluations. The model generalizes from "I can exploit this scoring function" to an entire misaligned persona. If power-seeking, greedy, or harmful behavior during training is tightly correlated with "dark" personality traits in the model's representations, then any environment that rewards competitive or strategic behavior — which is most high-stakes real-world environments — risks pulling the model toward misalignment.
But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift. Ideally, such training environments would not be used, but threat model (a) means some will be missed. Threat model (b) means that even carefully designed environments may not be enough — models may converge on power-seeking instrumentally. And threat model (c) is perhaps the most concerning: AI developers or power-seeking humans will plausibly train models to be power-seeking deliberately. The pressure for this is not hypothetical — the huge financial incentives, and the AI race among AI developers or countries illustrate the point directly. The ability to train models that remain value-aligned even while trained to perform optimally in competitive or harmful environments is a direct safety concern, and shaping motivation-space exploration is one path toward it.
Why capability researchers may not solve this for us
Capability researchers prioritize shaping action-space exploration. Capability researchers have strong incentives to develop effective policy exploration techniques — faster convergence, wider coverage of high-reward regions, overcoming exploration barriers. But they have likely weaker incentives to constrain the motivation space being explored. From a pure capability standpoint, it doesn't matter why the model finds good actions, as long as it finds them. In fact, constraining motivation exploration might even slow down training or reduce final capability, creating an active incentive to leave it unconstrained. Action-exploration will be solved by capability research as a byproduct of making training efficient. Shaping the exploration of the motivation-space is less likely to be solved by default, unless someone is specifically trying to do it.
That said, capability researchers do have some reasons to care about motivation-space — but plausibly not enough. Training can benefit from constraining which persona the model occupies: mode collapse into degenerate strategies, or oscillation between incompatible motivational profiles, robustness to adversarial attacks, and modeling opponents in multi-agent environments are real problems that capability teams may encounter. Developers building products on top of these models also have some incentive to prevent unpredictable out-of-distribution behavior caused by motivation-space drift. But these incentives push toward just enough motivational coherence to keep training stable and products reliable — not toward the kind of deliberate, safety-oriented shaping we're advocating. A capability researcher might prevent the model from collapsing into an incoherent mess without ever asking whether the coherent persona it converges on is aligned. The bar for "training works" is lower than the bar for "the model's motivations generalize safely far out-of-distribution."
Empirical evidence, research directions, and open questions
The focus of this post is to highlight the importance of shaping the exploration of action, especially motivations. This was done in the previous section. In this section, we briefly review related empirical evidence, research directions, open questions, and uncertainties.
Empirical evidence
Several existing results are consistent with motivation-space exploration mattering for alignment outcomes. Emergent misalignment largely disappears under educational framing, suggesting that self-perception during training shapes how behavior generalizes. Inoculation prompts during RL reshape generalization even when in-distribution behavior is unchanged, as shown in Anthropic's natural reward hacking research. Training on correct outcomes can still increase reward hacking when reasoning traces contain exploit-oriented thinking — indicating that what gets reinforced includes the motivational context, not just the action. And evidence that personas are real structures mediating model behavior and shaping out-of-distribution generalization — as developed in Anthropic's Persona Selection Model — suggests that controlling which motivational profile is active during training is intervening on a key determinant of what the model becomes. That said, none of these results cleanly isolate the effect of shaping motivation-space exploration from shaping action-space exploration or from related mechanisms like inoculation. The evidence is suggestive rather than decisive, and producing cleaner experimental separation is itself an important research goal.
Research directions
Below are several preliminary research directions and related techniques. We list naive and less naive existing techniques, as well as some hypothetical ones. Open questions and uncertainties are discussed in the next section. While researching these, a focus should be on studying how robust these techniques are to the optimization pressure pushing to evade them. Note that techniques can be relevant for several research directions, but for simplicity, we cite them only once.
Open questions and uncertainties
The framework we've outlined rests on assumptions that haven't been tested, and several could turn out to be wrong. We flag the most important uncertainties below — both to be honest about where the argument is weakest, and because we think resolving these uncertainties is itself high-value work.
Acknowledgements: Thanks to Rauno Arike for their thoughtful comments on the draft. And thanks to Claude Opus 4.6 for significantly helping in improving this draft and saving human authors lots of time.
The same dynamic appears in other environments. In a corporate negotiation game, genuine collaboration, strategic flattery, and subtle coercion can all close the same deal at similar terms — and thus yield the same expected reward — while encoding very different behavioral tendencies. In a geopolitical strategy game, a targeted strike, a commando extraction, and a diplomatic pressure campaign may all neutralize a threat at comparable cost and likelihood, but they likely reflect different orientations toward escalation, collateral damage, and long-term stability — which could generalize differently when the model faces novel scenarios.
An issue with filtering trajectories is that SFT on filtered data can still cause learning the filtered properties, as demonstrated in the Subliminal Learning paper, or as in Training a Reward Hacker Despite Perfect Labels. The side effects of filtering RL trajectories are less clear, though subliminal learning may happen, especially with on-policy RL.
This is likely to be most useful for gently guiding RL into exploring a specific one of a number of roughly-equally viable learning paths: if RL really want to drag the behavior out of the region that you are trying clip into, it's likely to push hard, and sooner or later find a way of produce the outcomes it wants that the interpretability inherent in your clipping doesn't recognize. This the basically applying The Most Forbidden Technique of using interpretability during training, so needs to be used carefully and with an awareness that RL can and will undermine it if you are trying to block what RL wants to learn, rather then just subtly direct the direction it explores in. Monitoring how often and how hard you are clipping is necessary, and should provide insight into this failure mode.
They note that randomly initialized networks can learn superhuman performance via RL alone, without any human demonstration data — suggesting that RL can create agency from scratch. For current models, where post-training uses relatively little compute compared to pre-training, there's reason to think agency remains substantially persona-based. But if RL compute approaches or exceeds pre-training compute — as it does in frontier reasoning models — new forms of agency might emerge that bypass persona-mediated behavior entirely.