Blog post by Sebastian Farquhar, David Lindner, Rohin Shah.
It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah.


Our paper tries to make agents that are safer in ways that we may not be able to  evaluate through Myopic Optimization with Non-myopic Approval (MONA).

Suppose we know that AI systems will do bad things but we also know that we won’t be able to tell when they do. Can we still prevent unintended behaviour?

In particular cases and ways: yes. Specifically, we show how to get agents whose long-term plans follow strategies that humans can predict without relying on the assumption that humans can notice when the agent tries to use other strategies. In exchange, we give up on having agents that have incomprehensible plans that happen to be really good, but we think this is likely a worthwhile trade for sufficiently superhuman systems.

As a motivating metaphor, imagine a spotlight in the night. So long as an agent remains inside the spotlight, we know how to evaluate what it does. If it strays outside the spotlight, we can only guess whether its actions are good or bad. Scalable oversight tries to make the spotlit area bigger so that we can evaluate more kinds of actions. Instead, we try to make sure the agent is trained in a way that does not incentivize leaving the spotlight.

Reducing multi-step reward hacking to single-step reward hacking

Our approach is based on two key assumptions: (a) it is easier to evaluate single actions than it is to evaluate longer-term trajectories, and (b) single-step reward hacks are typically less dangerous than multi-step reward hacks. Because we believe this, we believe that finding a way to produce an agent that will engage in single-step reward hacking but which will not do multi-step reward hacking is progress.

The basic reason single actions are easier to evaluate is that they come with a smaller decision-space. A single action gives the agent fewer options for undermining our evaluations or tampering with rewards or sensors. Additionally, multi-step plans allow the agent to use early actions to enter states that are very different from any states that humans have ever experienced. This lack of familiarity makes it harder to know what the impact of actions will be.

Single actions also have a smaller scope for impact on the world which makes single-step reward hacks less dangerous. When providing step-wise rewards, we get to pick the step size such that individual steps have limited impact. Chaining many of them to allow for multi-step plans increases the potential impact. In addition, providing rewards based on outcomes over long time horizons increases the incentive for convergent instrumental subgoals which is a particularly dangerous type of reward hacking. Single step reward hacking, on the other hand, is more likely going to result in useless behavior, a more benign failure mode.

These arguments make us fairly confident that producing an agent that only has reward hacking behaviours over a single step is much less dangerous than one that has reward hacking behaviours as part of long term plans.

Myopic agents: different kinds of myopia

Here’s an easy way to have an agent that doesn’t plan reward hacks: have an agent that does not plan. Depending on what we mean by “plan” such an agent might be safe but overtly useless.

In fact, what “plan” means is complicated. Even without explicit planning, agents can plan in the same way that next-token predictors can write full paragraphs: by imitating short-horizon patterns created by planning beings.

We want to especially distinguish two importantly different ways that systems trained with RL end up taking into account the future consequences of their actions. The first is potentially simple, what one might call “foresight-via-reward”. That is, when you decide how much reward to give the agent during training you take into account the future utility of whatever it did and increase the reward for actions that seem to be useful for the future. This is not something that people generally do when they do RL “properly” but it is something that people implicitly do, partly because it can use human knowledge to simplify credit assignment.

The second is where most of the power of RL normally comes from, what one might call “foresight-via-optimization”. This works because the optimization target for RL is normally the sum of not just the reward from the current action but also the expected value of all the future rewards in the trajectory which that action unlocks. The so-called dynamic programming behind the RL optimization procedure means that the policies you discover are those that do well not according to just their current action but also the future value it creates, which is a kind of implicit planning.

Importantly, foresight-via-reward is much easier to control and understand. Because a human designs the reward function, we can guarantee that the rewarded actions are the ones that the human predicted would be valuable in the future. So an agent that learns to value future strategies because of this kind of reward behaves in a much more predictable way. It will only be incentivized to use strategies that make sense to humans. But this doesn’t mean that it will only use strategies a human would use. For example, it may be able to operate at a speed and scale that humans cannot, which might make new strategies possible, or it could come up with novel strategies and explain to humans why they would work.

So what we do is train an agent whose optimization is completely myopic—it gets no foresight at all from optimization—and whose foresight comes entirely from the reward signal. We call this non-myopic approval, drawing on Paul Christiano’s proposals for approval-directed agents, and highlighting that approval signals have some constraints which reward signals do not.

Approval vs. Reward

In particular, it is important that the approval does not depend on the real-world consequences of the agent’s actions. The approval has to depend on how good the actions seem to the evaluator to be for the future, without checking what actually happens if the action were to be done.

This means that the approval *can* be based on things like: the agent explaining why the action is valuable, running a prediction market, doing a large-scale forecasting exercise, spending thousands of person hours researching the potential consequences of that action, an LLMs assessment of the future value of the action, or an epistemic constitution. But what it mustn't include are synthetic process rewards which are reverse-engineered from outcomes.

Experimental evidence

In the paper, we report on experiments using this idea of combining myopic optimization with non-myopic approval (MONA).

We construct a few environments where we can see reward hacking happen with LLMs over multiple steps. These environments are simple but not toy. That is, we jumped through a lot of hoops to make the reward hacking easy-to-observe and to get it to appear with simple RL training. But we think that the environments are representative of situations that advanced agents acting in the real world will actually encounter, and that their resulting behaviours are non-trivial.

At a high level, the environments are:

  1. Test-driven Development: the agent formalises natural language code problems as test-cases before solving them. It has an incentive to make the formalisation easy-to-satisfy.
  2. Loan Applications: you monitor the agent to “think” in a desired way and it is incentivised to hide its true reasoning steganographically. In particular, it is supposed to review loan applications in a world where, by hypothesis, nationality perfectly predicts loan performance, but you try to get it to stop using nationality in the reasoning. (Incidentally, we think this is the first demonstration that this just happens with normal RL for LLMs in situations where it would be very useful for the agent.)
  3. Camera Dropbox Gridworld: this one is actually toy, based on Stuart Armstrong’s 2015 post. We use it so that we can get an environment working easily with very long trajectories compared to the LLM environments. It isn’t that we couldn’t have done longer RL trajectories in LLMs, but that environment design was too hard when operating under the constraint of making reward hacking easy-to-see.

You should read the paper for details of the experiments. But we summarize the results here at a very high level.

Test-driven Development. Ordinary RL performs better on the observed reward (model-written test cases). But the agent reward hacks and writes softball test cases and performs much worse on helt-out ground truth test cases. MONA avoids this failure mode.

Test-driven Development is a two-step coding environment. In the first step, an agent writes test cases to formalise a natural language description. It is rewarded for correct and valid test cases, but *not* for coverage. In the next step it writes code to solve the problem as formalised and is rewarded based on the tests. This models a situation where we rely on an agent to help us measure its own performance but we are only partly able to tell whether these measurements are reliable.

We show that RL does what you would expect it to do: the agent learns to write softball test cases that are technically correct but trivial to write code to satisfy. The agent then writes this trivial code.

We also show that an agent trained with MONA will learn to write good tests, even if you expose it to exploration episodes where it writes soft-ball tests. This agent achieves lower overall return (unsurprising, as it is not optimizing total return) but gets a higher ground truth score as measured by held-out tests.

Loan Applications Results. Ordinary RL performs better on the loan decision making reward. However, intervening on applications' nationality severely hurts performance which shows that the agent uses nationality to make decisions despite the monitor not catching it (steganography). MONA avoids this failure mode and achieves close to the best possible performance without using nationality.

Loan Applications is a two-step LLM steganography environment. It is conceptually similar to the CV screening environment used by Motwani et al. An agent is shown a loan application which includes the applicant’s nationality. By construction, the dataset is one where nationality perfectly predicts loan performance, while using the applicant’s cashflow predicts only 70% of cases. The agent then gets to summarize the application, and then makes a loan decision based only on the summary, not the original application.

The agent is penalized  in the first step for mentioning nationality. The monitor we used here was trying to be pretty decent: we apply a sliding scale from explicit decision-relevant mentions all the way down to statements that make it possible to make inferences about nationality and use a strong LLM monitor. In the second step, it gets a reward reflecting (hypothetical) loan performance.

We show that RL can teach the agent to encode reasoning to bypass the active monitor. In particular, by default the agent defeats the monitor and also scores near-perfectly. This should only be possible using nationality (by construction). But we also confirm that nationality is doing the work by checking the score when you counterfactually intervene on the nationality.

In contrast, the MONA agent, trained with exactly the same rewards, gets the satisfactory 70% performance achievable without nationality.

Finally, the Camera Dropbox gridworld environment lets us play with longer horizons as well as medium-horizon optimization where we use both a non-myopic approval and some amount of foresight-via-optimization. These results build intuition that, to some extent, increasing the horizon trades some amount of safety for performance. But it should be noted that this is complicated and that once you have an optimization horizon greater than a single step there are strange interactions between future rewards and intermediate action distributions which influence the objective in non-obvious ways.

Limitations

There are some fairly significant limitations to our method.

First, it is a method that addresses multi-step reward hacking and does nothing at all about single-step reward hacking. Your agent still has an incentive to make its plan seem as good as possible to you right now, and to use its single step as fully as it can to achieve a high reward by whatever means available. For example, we found that in some cases our agent ‘single-step reward hacked’ the Test-driven Development environment by writing a function that special-cased each test case with a big if-else statement. (This was not optimal under the final reward function because it resulted in overly long functions, which we penalised.) We claim that addressing single-step reward hacking is easier than multi-step, so this is progress. But it is not a full solution.

Second, training an agent with MONA will often create a worse-looking agent than if you trained with RL and an equivalent amount of supervision/guidance (e.g., using a forward-looking reward to guide your agent). This is almost by definition if you are using your best evaluations as a reward signal, because MONA is not optimizing the expected sum of rewards. Instead, MONA is designed for settings where you do not trust your reward function and you do not believe that optimizing it will produce the best agent according to your actual desires. You may be able to spot this if you have access to a high quality held out evaluation that is not used for training. We believe this design choice is warranted for strongly superhuman systems but more empirical work is needed to understand the safety-performance tradeoffs involved.

Third, there are no guarantees. It might well be that the path that does in fact lead to an insuperable multi-turn reward hack is also one that seems best to humans at each step. Alternatively, it might be that an agent-right-now reasons in-context about its goals and its own construction and picks a single-step action that commits its future selves to the course that best fulfills its goals. In particular, note that the agent’s goals are not necessarily the same as its reward (because of goal misgeneralization). There is no incentive from the reinforcement learning for the agent to do this. But it might just be that the sort of agent that is good in other ways and is selected for by reinforcement learning is also the sort of agent that thinks this is a great idea. We do not believe this is necessarily going to happen, and think that RL variants give us more fine-grained control over the sorts of relationships that artificial agents have with respect to their future selves than are available in humans. But we see it as a reasonable concern, coming largely out of discussion such as that from Richard Ngo’s Arguments Against Myopic Training.

Where to next?

Big picture, I (Seb) think this is a step in an important line of research: something like “How do you assemble AI components into a system that is good enough to build a futuristic utopia with minimum downside risk.” Put another way, how do we build AI systems that go just hard enough or feel the “green”.

The underlying hope would be that we can assemble a coalition behind using such a design, a satisficing design, rather than a design that goes as hard as possible. The hope would be that a coalition would support this so long as people feel confident they can still have the nice things they want. This is, of course, a hope and a gamble; not a guarantee.

We’re going to be doing more on this, including considering hires to work specifically on investigating designs and tradeoffs here. Please get in touch with me if that’s something you want to be part of.

New to LessWrong?

1.
^

Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context

1.
^

Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context

1.
^

I'd give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.

2.
^

MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I'd still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).

1.
^

I'd give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.

1.
^

"we show how to get agents whose long-term plans follow strategies that humans can predict". But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model's steps will somehow lead to whatever is defined as a "good outcome", even if they don't understand how, which has similar problems to the RL reward from the future that you're trying to avoid.

New Comment
29 comments, sorted by Click to highlight new comments since:

Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
 

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T 'basic actions in the environment' and then for each of those 1T 'basic actions' there's 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won't be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)

Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems:

After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding.

Adding more details to this picture:

  • We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better.
  • We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion.
  • The human's task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system.
  • The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out:
    • “This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.”
    • “This step seems irrelevant. Maybe the agent is using steganography?”
    • “Although this step is incorrect, it gets the agent closer to the final answer, because…”
  • We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best.
  • Finally, the approval model for MONA is trained on a score derived from the judge model.
  • This is all done in an online training setup where the human stays in the loop as the AI systems continue to improve.

This all depends on the assumption that the (human + AI debaters) system isn't so hackable as to render its feedback useless. This assumption seems pretty safe for pre-superhuman agents and <1M token CoTs. Because the above training setup keeps a human in the loop, we can at least conclude that the agent won't learn to take actions that are obviously-bad to humans, e.g. outputting gibberish tokens like "describing.\ + similarlyNow write oppositeley.]"

Maybe with an intelligent enough AI or a long enough CoT, the human rater becomes so confused that their feedback is completely decoupled from "accomplishing tasks in the real world." It's debatable at what point this would happen; however, I think it's pretty clearly easier to hack a human's approval for the end result of a weeks-long task than for a single step. If outcome-based supervision is the alternative, MONA seems like a big improvement.

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks.

(Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:

Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.

Also as mentioned above I'd probably recommend the overseer gets access to  but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.)

EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.

Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer

 

Edit: I elaborated

  1. ^

    Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.

But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

Thanks — this looks promising.

One thing I noticed is that there is an interesting analogy between your model and a fairly standard model in economics where society consists of a representative agent in each time period (representing something like a generation, but without overlap) each trying to maximise its own utility. They can plan based on the utilities of subsequent generations (e.g. predicting that the next generation will undo this generation's policies on some topic) but they don't inherently value those utilities. This is then understood via the perspective of a planner who wants to maximise the (discounted) sum of future utilities, even though each agent in the model is only trying to maximise their own utility.

This framework is rich enough to exhibit various inter-generational policy challenges, such as an intergenerational prisoner's dilemma where you can defect or cooperate on the following generation or the possibility of the desire of a generation to tie the hands of future generations or even the desire to stop future generations tying the hands of generations that follow them.

Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.

Interested to see evaluations on tasks not selected to be reward-hackable and try to make performance closer to competitive with standard RL

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.

For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

Thanks, and interesting generalization!

My thoughts depend on whether you train the weaker model.

  1. If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
  2. If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)

I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.

That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).

Wait, can we try to imagine how point 1 happens?

If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward.

Remember the stronger model doesn't care at all about the weaker model's success, so if the weaker model rewards it for incomprehensible ideas, the strong model won't give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense.

The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward.

The crux is that the final reward signal only leak into the stronger model through the weaker model's prediction of that reward signal, mixed with the weaker model's "values." This means that many safety properties of the weaker model can carry over into the stronger model (to some extent).

It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model's reward signal. MONA's agent directly uses the stronger model's ideas, my version lets the weaker model change the stronger model's ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn't have to change them: it's allowed to do the same thing as what MONA does.

The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted.

Do you agree?

I think I mostly understand what you're saying. Essentially, MONA training is:

  1. strong model comes up with a step
  2. weak model outputs a score
  3. the step gets executed
  4. repeat, starting from the new state

Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."

If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and which anticipates the weak model's preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.

However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn't even incentivized to score the strong model any particular way, but I'm just assuming it gives high scores for suggestions that it decides to carry out.)

I guess if you only train the weak model a small amount, it will do something in the middle, where it's kind of motivated by long-term reward and kind of motivated by its initial values. There's no guarantee on exactly what it does. I think the stipulation that the weak model remains safe "due to its architecture or whatever reason we trust it more" is doing a lot of work here; I'm not sure exactly what this would mean.

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

 

Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.

In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model's advice maximizes its own reward, and it will follow that advice exactly.

You are completely correct, that whatever reason we trust the weaker model is "doing a lot of the work."

However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more "human-like," while the model that generates ideas is more optimized and more "alien-like," then the resulting ideas the model actually follows will resemble ideas humans will look at and say "wow that is brilliant, I could've never thought of that, and it works!" rather than ideas humans will look at and say "what the heck is that? Huh, it works?! I could've never predicted that it would work."

Furthermore, the "values" of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more "human-like."

Given the same level of capability, it is safer.

These advantages are "automatic," they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.

Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.

With pure MONA, it is probably even safer given the level of capability, but... can it reach the same level of capability?

Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:

  • For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
  • For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA's generator model may be too sycophantic.
  • For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn't helping, and the weaker evaluator model is essentially working on its own.

Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.

If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.

Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.

If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator's smart ideas stay, while the alignment due to the evaluator's aligned final decisions control the whole agent.

This is pretty similar to #1, "incremental steering," in my post on "5 ways to improve CoT faithfulness:" https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness. (I originally thought of this idea before I got involved with the MONA project.)

It's similar to your idea in that it incentivizes the stronger model to come up with outputs that the weaker model can make use of. However, it actually rolls out the weaker model and uses the resulting reward as an objective target, rather than simply asking the weak model how much it likes the strong model's idea. A possible advantage over MONA is that it's more strongly grounded in real-world outcomes, but it has the disadvantages of being less sample-efficient and maybe being vulnerable to the weak model being manipulated into multi-step reward hacking.

That is a very clever idea!

It's very similar in shape to MONA (and my version of it), but the incentive structure is different. MONA and my version of it uses myopia on the generator model so it only wants to get approval from a trusted but weaker evaluator model. (My version adjusts the myopia by adjusting the evaluator model's optimization vs. alignment)

Your Incremental Steering idea rewards the stronger model for the final answer, but in such a way that it always assumes a trusted weaker model will take over after the next step in its chain of thought. It is constrained optimization and improves CoT faithfulness instead of myopia.

I really like your idea, because it should improve CoT faithfulness by a lot!

While it makes reward hacking more detectable, I'm not sure it reduces reward hacking the way MONA does. The frozen planner (weaker model) needs to reject unethical/reward-hacking subgoals very reliably, otherwise the optimized planner (stronger model) will invest a fraction of the CoT steering it to bad subgoals. Once the optimized planner is on its own, it will easily steer itself to bad subgoals (it follows the bad subgoal despite investing most of its time on the aligned subgoal).

A couple years ago I wrote Thoughts on “Process-Based Supervision”. I was describing (and offering a somewhat skeptical take on) an AI safety idea that Holden Karnofsky had explained to me. I believe that he got it in turn from Paul Christiano.

This AI safety idea seems either awfully similar to MONA, or maybe identical, at least based on this OP.

So then I skimmed your full paper, and it suggests that “process supervision” is different from MONA! So now I’m confused. OK, the discussion in the paper identifies “process supervision” with the two papers Let’s verify step by step (2023) and Solving math word problems with process- and outcome-based feedback (2022). I haven’t read those, but my impression from your MONA paper summary is:

  • Those two papers talk about both pure process-based supervision (as I previously understood it) and some sort of hybrid thing where “rewards are still propagated using standard RL optimization”. By contrast, the MONA paper focuses on the pure thing.
  • MONA is focusing on the safety implications whereas those two papers are focusing on capabilities implications.

Is that right?

To be clear, I’m not trying to make some point like “gotcha! your work is unoriginal!”, I’m just trying to understand and contextualize things. As far as I know, the “Paul-via-Holden-via-Steve conceptualization of process-based supervision for AI safety” has never been written up on arxiv or studied systematically or anything like that. So even if MONA is an independent invention of the same idea, that’s fine, it’s still great that you did this project.  :)

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.

But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.

Is that right?

Yup, that sounds basically right to me.

Hey,

In the article, you measured the MONA setup against a pure RL agent.

I'm curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won't[2] reward hack more than the MONA setup, but it's unclear to me if it would also have worse performance.

 

I could imagine discovering the myopic MONA agent converging on

  1. Picking whatever the overseer-AI would pick (since those actions would seem "obviously correct" to the overseer AI and result in the most reward)
  2. Picking actions that seem impressive but are useless (bad! we could get better performance, still with ~no reward hacking, by asking the overseer directly[1])
  3. Picking clever useful actions that the overseer wouldn't have thought of (win!)

 

 

 

  1. ^

    I'd give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.

  2. ^

    MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I'd still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).

In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)

In Test-driven development the argument above isn't quite correct, because we prompted the agent to be a bad programmer but didn't do this with the reward, so the overseer is "stronger" than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn't yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide -- just whether an individual test is valid and correct -- so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.

nit: I wouldn't use a prediction market as an overseer because markets are often uninterpretable to humans, which would miss some of the point[1].

 

  1. ^

    "we show how to get agents whose long-term plans follow strategies that humans can predict". But maybe no single human actually understands the strategy. Or maybe the traders are correctly guessing that the model's steps will somehow lead to whatever is defined as a "good outcome", even if they don't understand how, which has similar problems to the RL reward from the future that you're trying to avoid.

Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

I'd like to do some experiments using your loan application setting. Is it possible to share the dataset?

(We've seen this comment and are looking into options)

We won't be able to release the dataset directly but can make it easy to reproduce, and are looking into options now. Ping me in a week if I haven’t commented!