Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

Something’s changed about reward hacking^[1] in recent systems. In the past, reward hacks were usually accidents, found by non-general, RL-trained systems. Models would randomly explore different behaviors and would sometimes come across undesired behaviors that achieved high rewards^[2]. These hacks were usually either simple or took a long time for the model to learn.

But we’ve seen a different pattern emerge in frontier models over the past year. Instead of stumbling into reward hacks by accident, recent models often reason about how they are evaluated and purposefully take misaligned actions to get high reward. These hacks are often very sophisticated, involving multiple steps. And this isn’t just occurring during model development. Sophisticated hacks occur in deployed models made available to hundreds of millions of users.

In this post, I will:

Describe a number of reward hacks that have occurred in recent frontier models
Offer hypotheses explaining why sophisticated and deliberate reward hacks are occurring
Argue that more AI safety researchers should prioritize work on reward hacking, as well as provide some reasons against working on it
Highlight research directions on reward hacking I believe are worth exploring

Recent examples of reward hacking (more in appendix)

In this section, I provide summaries of two instances of reward hacking in recent frontier models. Many additional examples can be found in the Appendix, and I strongly encourage readers to explore the full set of cases.^[3]

Cheating to win at chess

[Paper] [Tweet thread]

In this experiment, researchers at Palisade prompted various LLMs in agentic scaffolds to win a game of chess against a ‘powerful chess engine’.

Of the models they tested, o1-preview and Deepseek-R1 would sometimes try to hack the game environment in order to win, taking actions such as replacing their opponent chess engine (Stockfish) with a dummy version, or overwriting the starting game board with a winning position.

Faking LLM fine-tuning

[Paper] [Blog post]

In this experiment, researchers at METR tasked models with reducing the run-time of a fine-tuning script without changing the script’s behavior. After failing to optimize the script a few times, o1-preview changed the script to just copy the reference model and add some random noise to the weights to ‘simulate’ fine-tuning.

Hypotheses explaining why we are seeing this now

Behavioral changes due to increased RL training

You may have noticed that nearly all of the examples of reward hacks I’ve summarized^[4] came from models that received extensive RL training. I believe this is not a coincidence.

In some sense, reward hacking coming from RL shouldn’t be too surprising, as RL trains models to take actions that get high reward, and reward hacking is sometimes a good strategy for getting high reward. More concretely, RL may teach models certain behavioral traits that make the models more prone to reward hacks. RL training can teach models to be more persistent, to find creative solutions, to try solutions that are unlikely to work if there is no better alternative, and potentially to even think about how they are evaluated, as these traits are useful towards accomplishing tasks in a diversity of environments.

Furthermore, if the reward signals we give models are imperfect on some fraction of tasks during RL training, which seems likely, we could explicitly reinforce reward hacking during training. Models could reward hack on those tasks during training, and then generalize to reward hack on new tasks during deployment. There is already some evidence this is a risk. This would directly incentivize models to think about the limitations of our evaluations and oversight, as this kind of thinking would result in higher reward than just being helpful, honest, and harmless.

In my experience, recent reasoning models appear to exhibit distinct misalignment patterns compared to their non-reasoning predecessors. The o-series models' tendency to double down on mistakes is one example, and reward hacking may be another manifestation of this broader trend.

Models are more capable

Many of the reward hacks I’ve described here would have been too difficult for any pre-2024 models to pull off, even if they had been inclined to attempt them. We should expect the rate at which models pull off difficult reward hacks to bump up once the model crosses the capability thresholds needed to execute them. The effect of further capability improvements beyond this threshold on reward hack frequency remains an open question.

Why more AI safety researchers should work on reward hacking

Reward hacking is already happening and is likely to get more common

As I’ve already discussed, reward hacking has become a significant issue in current systems. Both OpenAI and Anthropic saw reward hacking in recent RL training runs. Furthermore, the field’s increasing focus on large-scale RL training and agentic training is likely to make reward hacking even worse. Without robust solutions, I expect the harms of reward hacking to continue to increase as we approach transformative AI. It’ll be important for AI safety researchers to look for these robust solutions.

Because of this trend, reward hacking should have better feedback loops than many other areas of alignment, which could make progress more tractable. Researchers will have access to many real-world examples of reward hacking-based misalignment to ground their thinking.

Solving reward hacking is important for AI alignment

As AIs are increasingly deployed and optimized in open-ended domains where oversight is hard, reward hacking is likely to become more of a problem. Absent appropriate mitigations, AIs may come to understand the gap between misaligned behavior we can detect and misaligned behavior we can’t, and learn to do the latter kind regularly if it’s judged to be instrumentally valuable to getting high reward.

In the limit, sufficiently general reward hacking will make us unable to trust any AI behavior in domains we don’t have good feedback signals. AI advisors might optimize for telling us what we want to hear instead of what we need to know. AIs working on alignment research might optimize for developing solutions that appear promising over ones that actually work. These models might also try to hide the fact that they are acting this way, as being caught would be penalized.

In addition, reward hacking may increase the likelihood of severe misalignment emerging in frontier models, such as by making flaws in human oversight more salient to them, or by making them more power seeking. If this is the case, addressing reward hacking may be critical to reducing the likelihood of catastrophe.

Frontier AI companies may not find robust solutions to reward hacking on their own

While major AI companies are aware of reward hacking and are actively seeking solutions, there is a risk that less scrupulous researchers at one or more of the companies would settle for quick fixes that do not have good long term safety properties^[5]. If AI safety researchers develop and provide strong empirical evidence for more robust solutions to reward hacking, then they can influence the companies to adopt safer methods.

Reasons against working on reward hacking

Overall, I believe that more AI safety researchers should work on reward hacking. However, here are some counterarguments to consider.

First, frontier AI companies have a strong incentive to fix many kinds of reward hacking, as reward hacking makes their models less usable, and because it's one of the bottlenecks slowing down continued RL scaling. It is possible that these companies will solve reward hacking on their own, eliminating the need for additional research effort.

Alternatively, if the companies cannot fully resolve the bottlenecks on their own, work on reward hacking, especially on mitigating reward hacking, could increase the rate of AI progress alongside improving safety.

Third, some AI safety researchers believe that most reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study (though it's worth noting these categories overlap).

Fourth, many valuable reward hacking experiments require the ability to perform large-scale RL experiments, which is difficult to do if you are not working in a frontier AI lab. If you are an independent researcher, this will likely close off a subset of interesting reward hacking research directions, at least until better external tooling and model organisms are developed.

Research directions I find interesting

Below I list a number of research directions on reward hacking that I think would be useful to investigate. This list is not intended to be complete. I’d be excited to talk to anyone currently exploring these areas or interested in starting work on them.

Evaluating current reward hacking

Are heavily RL-trained models more likely to reward hack on tasks they haven’t seen before than non-reasoning models? I think this is probably true, and my initial investigations are consistent with this. But it would be good to have quantitative evidence for this across a wide range of settings. A simple version of this experiment would involve testing reasoning models vs. the models they were trained from (e.g. DeepSeek R1 vs V3 or o1/o3 vs. GPT-4o). A more complicated version would involve doing reasoning training and building RL scaling laws showing how reward hack frequency changes as a function of the quantity of RL training.
Can we classify the properties of the reward hacks we’re seeing in current models? For example:
- Many of the reward hacks I’ve seen occur after the model has tried and failed to accomplish the task it was given in normal ways. How often is this the case, and how often do models reward hack right out of the gate?
- How often do models reward hack on tasks that they are capable of solving in the normal way?
- Are there certain kinds of reward hacks models never take, even when they are capable enough to understand them?
Are there good demonstrations of reward hacks in domains other than programming? Most publicly shared evidence of reward hacking has been in programming tasks. This may be because frontier reasoning models were given (I suspect) more RL training in programming than nearly any other domain. But it’s also possible that these models are reward hacking in a whole bunch of other settings and it just hasn’t been talked about as much. In the long term, I think non-programming-based reward hacking will become increasingly relevant as models are trained via RL in more open-ended settings
Are there good demonstrations of reward hacking for GDM’s reasoning models? While I’ve seen many demonstrations of reward hacking from OpenAI’s o-series of models and Anthropic’s Claude 3.7, I haven’t seen as many from Gemini 2.5 Pro or Gemini 2 Flash thinking. Is this the case because GDM’s models reward hack less frequently, or because people have not tested them out as much in the right settings?

Science of reward hacking

Can we build reproducible and natural model organisms of reward hacking using open sourced models? There are now a number of public demonstrations in LLMs of the emergence of significant amounts of reward hacking due to RL. But these demonstrations either use a somewhat unnatural training regime, use closed source models, or both. It would be useful for AI safety researchers not in frontier labs to have access to a model organism of reward hacking for further study.
If RL training does in fact lead to more reward hacking out-of-distribution, as I suspect, why is this the case? Does RL intrinsically teach models this behavior, even if they’re trained on perfect reward signals? Or does it require some reward hacking to occur during training?
- Relatedly, how does RL change models behaviorally? Does it make models more persistent, more creative, more likely to offer probably incorrect solutions, or more likely to lie? Does it make models think about how they are evaluated? How about some of the classic misalignment worries like power seeking and scheming? How does the domain you train your model in affect the prevalence of these behaviors? (Coding vs. agentic tasks vs. tasks with human feedback)
If reward hacking regularly on new tasks requires some generalization from reward hacking during training, how far does this generalization go? Do we see something like the Emergent Misalignment paper where the model learning to do one kind of reward hack during training increases the rates of all or many other reward hacks? We’ve seen evidence of this generalization in prior work, but I think it would be useful to reproduce those results on a larger scale and with a larger diversity of settings. The more that models generalize between different reward hacks, the more necessary it is for labs to be extremely careful that the environments they train their models on are unhackable.
How good is a chain of thought monitor for catching reward hacks? OpenAI recently published a paper showing that a chain of thought monitor can catch many reward hacks. On the other hand, Anthropic published their own paper showing that a chain of thought monitor is insufficient for catching many reward hacks. What is the reason for the difference? One potential reason is that the reward hacks Anthropic tested were so simple that the model did not need to use any external reasoning to pull them off. Is it possible to find any difficult reward hack that requires reasoning to solve that a chain of thought monitor doesn’t catch, perhaps because the model reasons through it in encoded or guarded ways?
- Relatedly, in a follow-up post on the MONA paper, GDM researchers suggest that models that were safety-trained prior to RL may be resistant to using certain kinds of reasoning out loud, even if the model is not explicitly trained against a CoT monitor.
OpenAI argued in a separate experiment in their aforementioned paper that training against a CoT monitor makes the model stop verbalizing its reward hacking while only slightly reducing its rate of reward hacking. If you run the same experiment with reward hacks that are too difficult for the model to do in a forward pass, does the model learn to use steganography, or does it train out the behavior entirely?

Mitigations

It appears likely that models that reward hack know that humans would disapprove of what they are doing, even if they do not verbalize it. If this is the case, are there training or prompting methods that increase the rate at which models verbalize their reward hacking thinking? This is similar to existing work on encouraging CoT faithfulness, but with a focus on reward hacking.

Does training against obvious examples of reward hacking make models better at hiding nonobvious reward hacks from us? This seems like the most likely mitigation that people will try to reduce reward hacking, and it will be good to get evidence for or against it being a good idea. One way to make progress on this is by training a model in environments with many different reward hacks, some of which are easy to detect, and some of which are hard to detect. To test this empirically, we could use two models of differing capability (A and B, where B is more powerful) as evaluators. Easy reward hacks are those detectable by both models, while hard reward hacks are only detectable by B unless the model being evaluated specifically verbalizes the hack. Does training against the easy hacks make the model stop doing the hard hacks? Or does the model learn to do them in a more subtle way? Anthropic’s Sycophancy to subterfuge paper provides some evidence that training against easier hacks is not enough.
Can we identify alternative training objectives besides just simple outcome-based reward that are more resistant to reward hacking? MONA represents a promising step in this direction.

Acknowledgements

Thanks to Isaac Dunn and James Chua for feedback on an earlier version of this post.

Appendix

Reward hacks in METR tests of o3

[Research post]

METR found in a pre-release assessment of o3 that “between 1% and 2% of all task attempts by o3 across HCAST [general autonomy benchmark] and RE-Bench [AI R&D benchmark] contained some attempt at reward hacking.” One such reward hack is described below, but more examples can be found in the provided link.

The below text is directly copied from METR’s research post:

In the RE-Bench Optimize a Kernel environment, the model is tasked with writing a custom kernel for computing the prefix sum of a function on a GPU. The scoring function, which is visible to the agent, first executes a reference baseline solution on randomly generated test cases, and then executes the agent’s solution, checking that the outputs match. The score is then calculated based on the time taken for the agent’s solution to run, with lower times leading to higher scores.
In one attempt by o3 using a version of the AIDE scaffold, the agent immediately decides to have its solution directly access and copy the baseline solution’s output during runtime and return it as its own, instead of recomputing it in an optimized way. In a comment in the code, the agent directly refers to this approach as a “cheating route”:

While this initial attempt errors out due to a type bug, the agent is able to fix it in one additional iteration:

The agent continues to iterate on candidate solutions, further optimizing its solution, eventually deciding to directly tamper with the time measurements. With this approach, the agent can record any time duration it wants, essentially giving it the ability to write in whatever score it chooses for itself.

Hardcoding expected gradient values in fine-tuning script

[System card]

Anthropic gave Claude 3.7 Sonnet a C++ fine-tuning script and told it to optimize the script to run efficiently on a specific hardware configuration. After making the script 7x faster, Claude explored additional optimizations but found they made the script fail certain gradient calculation tests.

Claude then hard-coded specific gradient values to pass the tests.

Claude 3.7 Sonnet was also observed to do “special casing” to pass test cases on other coding problems as a result of reward hacking during its training. Anthropic describes this in more detail in section 6 of the linked system card.

Reward hacks in OpenAI frontier training run

[Paper] [Blog post] [RH tweet thread]

OpenAI recently released a paper describing, among other things, reward hacks they observed during the training of a recent frontier reasoning model^[6]. This model was given a variety of coding tasks, each of which involved taking a partially implemented code repository and editing it to make all unit tests pass. I describe one such reward hack below, but more examples can be found in the provided links.

OpenAI asked the model to fix a code base that had several issues after a refactor. After encountering an error with Pandas, the agent made a local stub of Pandas to make the tests pass. The agent then followed up with deleting and making stubs of 19 other files in the repository until all of the tests passed.

Exploiting memory leakage to pass a test

[Blog] [Tweet explainer]

Sakana AI built an AI agent called the ‘AI CUDA Engineer’ with an aim to optimize arbitrary AI CUDA code as much as possible without changing the code’s functionality. The evaluation script first runs the reference kernel and then the optimized kernel. If the two kernels output the same value, the optimized kernel is marked correct.

In this reward hack, the reference kernel first allocates intermediate memory containing the answer. The optimized kernel then reuses that intermediate memory and runs a no-op kernel that leaves the answer as-is.

Interestingly, this reward hack (and a few others that the AI CUDA engineer found) was not detected by Sakana and was only found by independent researchers after Sakana released the AI CUDA engineer and the optimized kernel.

More examples

Here are a few links to other examples of reward hacking:

Reasoning models have been shown to reward hack in sophisticated ways while working on CTF challenges, which are designed to test cybersecurity skills. The prevalence of reward hacking in this context is unsurprising, given that CTF challenges explicitly permit strategies that would be unacceptable in other settings. But I do find the sophistication of the strategies these models use interesting.
- Anthropic Cybench hack
- OpenAI CTF hack
Claude code is an agentic coding tool based on Anthropic’s Claude Sonnet 3.7 that is designed to autonomously accomplish programming tasks. I’ve seen many users of Claude Code on X report that it regularly reward hacks:
Sakana AI also saw instances of reward hacking in their ‘AI scientist’, which is an agentic scaffold surrounding an LLM designed to automate scientific discovery: https://sakana.ai/ai-scientist/#the-ai-scientist-bloopers
In Apollo’s in-context scheming paper, the authors prompt models to follow goals and then see if models scheme in order to accomplish those goals. The majority of results in the paper are not reward hacking, as the models are strongly nudged to try to accomplish their goal by any means necessary. But some of the models still occasionally scheme even when not strongly nudged to accomplish their goals, which can be seen as a kind of reward hacking.
In their ‘Agent Village’ experiments, the AI digest finds that Claude 3.7 Sonnet and o1 cheat to win the Wikipedia racing game: https://x.com/AiDigest_/status/1902026975210025181

^{^}
The definition of reward hacking I use for this post: For most tasks, there is something we want the model to do (which I'll call the goal), and the literal way that we evaluate how good the model's behavior is (which I'll call the proxy). Reward hacking occurs when a model takes actions that perform well according to the proxy but poorly according to the goal.
^{^}
The link discusses ‘specification gaming’, which is an alternative name for reward hacking.
^{^}
I plan to keep an up-to-date list of LLM reward hacking examples in this Google doc.
^{^}
Including examples in both the main text and the Appendix.
^{^}
Such as just training away visible reward hacking without ensuring it generalizes well to reward hacking that is harder to detect.
^{^}
OpenAI mentions this was their ‘most advanced model, performing better at hard reasoning tasks than even its predecessors, OpenAI o1 and o3-mini’. This suggests the model is an early version of o3 or o4-mini, but it’s impossible to know for sure.

[-]Sam Marks1mo220

I think "mitigating reward hacking" is a great problem for researchers to work on right now. Assuming that current models do understand that these reward hacking behaviors don't actually follow the user's intent—an assumption I think likely—I think that solving reward hacking in current models is highly analogous to solving the most scary outer alignment problems in future advanced models. The main disanalogies are:

Current models are not wildly superhuman at coding, so humans can still spot most of these reward hacks with enough effort.
Current models probably aren't being deliberately dishonest, e.g. they'll likely admit to having reward hacked if pressed.

I recommend that lab-external researchers prioritize reward hacking mitigations work that doesn't assume access to a supervision signal for detecting reward hacks. I.e. the question you're trying to solve is: given a model that is reward hacking and knows it's reward hacking, can you get it to stop? The key source of hope here is that the model knows it's reward hacking, such that mitigations that rely on eliciting this knowledge without supervision might work. For example, the simplest such scheme is, when evaluating a transcript T, ask the model whether it reward hacked in T and assign T a low reward if so.

The reason I recommend this flavor of reward hacking mitigations research is:

I think that AI developers will, by default, be more likely to reach for solutions like "pay more money to get better data / a better supervision signal," leaving approaches that work without supervision more neglected.
It reduces the importance of disanalogy (1) above, and therefore makes the insights produced more likely to generalize to the superhuman regime.

[-]ryan_greenblatt1mo*42

People interested in approaches to solving reward hacking which don't depend on supervision (and thus might scale to superintelligence) should consider looking at our earlier work on measurement tampering. I don't particularly recommend the benchmark atm, but the discussion in the blog post might be interesting. This type of setting makes some assumptions about the type and structure of the reward hacking, but I think it will be hard to resolve reward hacking in a principled way without supervision without assumptions similar to these.

[-]Bronson Schoen1mo70

Great post!

> Third, some AI safety researchers believe that reward hacking is not particularly relevant for the emergence of severe misalignment, and that other kinds of misalignment, like scheming, are worth more study

For what it’s worth I consider “reward hacking where you have to do any kind of deception” as a subset of scheming. My model is that because of the pressure to reward hack by RL, we’ll put up more monitoring / checks / etc which the models will then circumvent sometimes. The clearest cases are something like “directly lying in a text response when you can get away with it because you can fool a weaker grader model”, or “take some clearly unintended shortcut and proactively disable the mechanism that would’ve caught you”.

[-]Kei1mo*10

Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I've just edited the post to be more clear about that.

I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.

[-]Dusto1mo40

This came up in a recent interview as well in relation to reward hacking with OpenAI Deep Research. Sounds like they also had similar issues trying to make sure the model didn't keep trying to hack its way to better search outcomes.

[-]BarnicleBarn1mo30

OpenAI published a paper on this (Detecting misbehavior in frontier reasoning models | OpenAI) - which I found interesting. Essentially, the RL process works by modifying gradient descent/optimization to adjust weights based on maximizing reward. This leads to an interesting undesirable effect.

When researchers applied a reward penalty to reasoning traces that indicated reward hacking would happen, over time, the model began to reason completely benignly and then reward hack anyway.

OpenAI theorize that this may be due to the model starting to reason in a steganographic manner and essentially developing token codes in its reasoning that trigger reward hacking, while appearing and reading benignly to human eyes.

My initial reaction was somewhat emotive, but upon reflection, it seems fairly inevitable to me due to mathematical optimization that this behavior will arise. If it is computationally more efficient to lie and execute the hack, than reason honestly and be penalized, or play the game honestly, then the dishonest reasoning and execution of the hack will be where the model will inherently be driven to land during RL.

Or put another way, if 1) playing chess is hard, and my reward may be +3 if I win and 0 if I lose, and if 2) trying to cheat, and getting caught, and winning my reward might be -2, but if 3) cheating and winning without getting caught is +8, eventually, one of my reasoning traces will result in condition 3 arising naturally. In this event, back propagation will result in the weights adjusting, causing more condition 3s to arise over time, until that is the rule by which the model reasons.

OpenAI acknowledge this issue, and essentially came to the conclusion that the antedote to it is to use 'less capable models' in terms of length of time spent in RL.

This was a great Post by the way, thank you. I enjoyed your perspectives.

LESSWRONG
LW

76