This post discusses an issue that could lead to catastrophically misaligned AI even when we have access to a perfect reward signal and there are no misaligned inner optimizers. Instead, the misalignment comes from the fact that our reward signal is too expensive to use directly for RL training, so we train a reward model, which is incorrect on some off-distribution transitions. The agent might then exploit these off-distribution deficiencies, which I’ll refer to as reward model hacking.
Fwiw, I would say that in this case you had an inner alignment failure in your training of the reward model.
(Or alternatively, I would think of the policy + reward model as a unified AI system, and then say that you had an inner alignment failure w.r.t the unified AI system.)
I'm not sure everyone would agree with this; I've found that different people mean different things by outer and inner alignment.
This post discusses an issue that could lead to catastrophically misaligned AI even when we have access to a perfect reward signal and there are no misaligned inner optimizers. Instead, the misalignment comes from the fact that our reward signal is too expensive to use directly for RL training, so we train a reward model, which is incorrect on some off-distribution transitions. The agent might then exploit these off-distribution deficiencies, which I’ll refer to as reward model hacking.
I’m sure that others have thought about this issue before, but I didn’t find much discussion focused on it. So I’m writing this post so that either someone can explain to me why this isn’t a big deal, or to give it a name and some explicit analysis. Depending on how hard reward model hacking is to deal with, it could present a significant challenge to the entire approach of doing RL + reward learning, and my main goal is figuring out whether that’s the case.
The Setting
I will focus on the case where the policy learned via RL is able to do some online planning or reasoning about the world—it can come up with action sequences that lead to high reward without ever having tried out those action sequences before. I don’t care much here whether we have a learned mesa-optimizer, or a search process that we built explicitly, or just a bunch of really good heuristics that when taken together yield similar behavior.
Reward model hacking is also an issue without online planning. But the online planning setting makes it clearer that reward model hacking could be a really hard to deal with fundamental issue for reward learning, rather than a small technical problem. I'm much less sure whether that's also the case without online planning capabilities.
Apart from that, I’ll make rather optimistic assumptions to isolate reward model hacking from other potential failure modes:
If we furthermore assumed that the reward model had learned the reward signal perfectly, then we would have solved alignment—the agent would optimize the assumed-to-be-perfect reward signal.
But what if the reward model is not quite perfect, and in particular if it gives incorrect rewards on some off-distribution transitions? The next section makes a case for why this could be really bad rather than just slightly inconvenient.
The Problem
Here’s one concrete plan that the agent could come up with: get direct read access to the reward model's weights, then use gradient descent to find states that get extremely high reward. Then look for action sequences that lead to those states. This precise plan is probably too specific to be likely to be the one the agent will implement. But my point is that the agent could be really good at maximizing reward according to the reward model, and this example is supposed to make that possibility more salient.
So what happens if the reward model is optimized for really hard? What do these states with extremely high reward look like? One analogy we can draw comes from interpretability research, where people optimize the input to an image classifier to get the image that looks the most dog-like to the network. What they get is not an image of a dog:
Now, I expect we can make our future reward models much more robust than the early CNN that this image was generated from. But it also seems likely to me that we won’t be able to get rid of issues like this entirely.
(Part of) what’s going on here is that there’s a huge space of inputs that’s wildly off-distribution, such as the image above. If you have a random function that fits the training data, it’s likely that it will give an even higher output on some off-distribution input than it does for any actual image of a dog. Of course we don’t have just any random function that fits the training data—inductive biases from the model architecture, optimizer, and regularization lead to some amount of generalization. But ensuring that the state with the highest reward is one we actually like could be a high bar; I’ll discuss some challenges later.
How bad would it be if the agent finds weird off-distribution states that get very high reward? I think it’s likely catastrophically bad—most weird off-distribution states are bad states for humans. That said, we might end up with a reward model that’s not quite perfect, but good enough that the reward-maximizing states are at least close to the states we actually like best. (For example, the image above is at least more similar to a Dalmatian than white noise would be, and perhaps our reward models will be much better in this respect). In such a scenario, it becomes less obvious how bad this outcome is. It will depend on just how close the off-distribution states that are found by the agent are, and according to what metric.
Won’t online learning just fix the reward model?
The entire reason why people train the RL agent and reward model in parallel is that if the RL agent does something new, we can get human feedback on these new outcomes, and then train the reward model on that feedback. So hopefully, whenever the RL agent finds some weird way to exploit the reward model, the reward model will quickly be updated.
But this doesn’t help for the failure mode I’m most worried about: the agent comes up with a plan to get high reward using its world model, rather than by trying out the plan, and once the plan is executed, it’s immediately catastrophic. In particular, the agent might know that we will update the reward model if it doesn’t stop us from doing so, and that would be bad according to the current reward model.
Solutions?
I’m still uncertain how hard reward model hacking will be to prevent. This is an important question: if there’s some relatively easy fix, we don’t need to worry about it too much now. On the other extreme, if it’s essentially unfixable within an RL + reward learning framework, that would be important to know as soon as possible.
Maybe everything is just fine by default
I wouldn’t be shocked if reward model hacking turned out not to be a problem in practice, though my best guess is that it will be. Some ways in which we might just be fine without much directed effort:
Deliberate solutions
Even if the issue I’ve outlined is dangerous “by default”, maybe it’s quite easy to solve. Some avenues I can think of:
These are examples of trying to address the problem without changing the overall framework of RL + reward learning. Another approach would of course be to solve the problem “at its root”. The fundamental issue is that the RL training process and the reward learning process are in some sense at odds with each other—they’re not explicitly maximizing the other’s loss, but essentially the RL training constantly attempts to “exploit” the current reward model by getting high reward in some easy way. The scenario I’ve described here, where the RL agent itself is deliberately searching for outcomes with high reward, is just an extreme case of that.
I am very enthusiastic about trying to avoid this fundamental problem altogether. Cooperative Inverse RL would be one aspiration, but doesn't really tell us how to implement it in practice—if we just assume some model p(actions|reward function) for the human, then that model will probably be somewhat wrong, which leads to similar issues. Another approach could be Semi-supervised RL, where we directly use the expensive ground-truth reward signal, rather than first training a reward model to approximate it. But currently, reward learning is by far the dominant approach to aligning AI systems in practice, presumably because it's the approach that we can get to work best. That's why I've focused on solutions within the RL + reward learning framework—if we need to leave that frame work to avoid reward model hacking, that's important to know!
Conclusion
My best guess is that reward model hacking is a serious problem that we need to deliberately solve if we want to get RL + reward learning to work, at least if our agents are capable of zero-shot generation of plans for achieving high reward. A crucial question, which I am less certain about, is whether reward model hacking can be addressed within reward learning at all, or whether it is a sufficiently fundamental problem that we'd be better served by looking for alternative frameworks.
To be clear, I don’t think that reward model hacking will be more challenging than e.g. getting a “perfect” loss signal for the reward model in the first place, or avoiding inner optimizers with clearly bad objectives. But I’m somewhat worried about spending a lot of effort on improving reward learning techniques and then later finding out that we need to fundamentally change our approach and thereby invalidate a lot of progress.
I'd be excited to hear about either reasons why reward model hacking won't be a big problem in practice, or conversely why it will require an entirely different approach to solve!
Thanks to Adam Gleave, Anson Ho, Jan Kirchner, and Tom Lieberum for feedback and discussions on a draft of this post!
There may be no such thing given that humans aren’t expected utility maximizers, but I think if anything that fact will make things even more challenging.
Parallel training is meant to be the best-case assumption, see the section on "Won’t online learning just fix the reward model?". It's not an important part of the setting, the argument also works if you first train a reward model and then the RL agent.