OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to happen in practice.
2.) They see what happens when you try to "train away the bad thoughts." Result: This does seem to maybe genuinely reduce the amount of bad thoughts, especially in the short run, but in the long run it results in the model learning to still think the bad thoughts but in an obfuscated or illegible way that doesn't look bad.
3.) They recommend:
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
4.) There's some more cool results in the paper along with some pretty good discussion of future possible things to try and failure modes thereof, and the costs and benefits of their proposed policy.
My Commentary:
I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a recommendation not to train the CoTs to look nice; I hope Anthropic and GDM soon issue similar recommendations. If we can get this to be industry standard practice, we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking. During that golden era, much alignment science can be done, such as the model organisms stuff Anthropic and Redwood are doing, and moreover we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.
Example questions to pursue:
a. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?
b. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?" Or do they look like "Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I'm actually being deployed!" Or some fourth thing?
c. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?
d. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you've studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it's trying to do something bad e.g. hack it's way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?
e. What about if you do shoggoth/face? Does that change things?
f. What about if you throw in a paraphraser?
g. What are the scaling laws / trends for all of the above?
h. many more things besides, I bet, I just generated all of the above off the top of my head. This is a super promising and important area of research!
Well done and congrats to Bowen, Joost, David, and the other authors!
Minor Criticisms:
We argue that this traditional perspective is more complicated when viewed in light of chain-ofthought monitorability. Specifically, we have shown that incorporating CoT monitors into the RL rewards—a form of process-based supervision—can increase obfuscation and reduce interpretability. We advocate for maintaining a degree of outcome-based supervision to ensure that models can freely express their reasoning in CoT with minimal constraints. Denser supervision signals such as SFT are also being used to optimize CoTs [58], and it is unclear the effect these have on downstream monitorability. Moreover, there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT. Understanding the cost-benefit trade-offs of using process-based supervision while retaining, or possibly even improving, the monitorability of CoTs remains an exciting and open area for future research.
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model. Imitation is relatively safe/benign but also I predict will not lead to the highest benchmark scores. Pure outcome-based feedback gets you reward hacking and misalignment like crazy, BUT that's OK because you can see it happening in the CoT! (with some caveats of course, e.g. not necessarily especially as the models get really smart). So the bottom line takeaway I'd give is: Process-based and outcome-based are fine in isolation, but they should not be mixed.
They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!
This definitely aligns with my own experience so far.
On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3-mini at work: it could almost do what I needed it to do, yet it frequently failed at one seemingly easy aspect of the task, and I could find no way to fix the problem.
So I tried Claude 3.7 Sonnet, and quickly figured out what the issue was: o3-mini wasn't giving itself enough room to execute the right algorithm for the part it was failing at, even with OpenAI's "reasoning_effort" param set to "high."[1]
Claude 3.7 Sonnet could do this part of the task if, and only if, I gave it enough room. This was immediately obvious from reading CoTs and playing around with maximum CoT lengths. After I determined how many Claude-tokens were necessary, I later checked that number against the number of reasoning tokens reported for o3-mini by the OpenAI API, and inferred that o3-mini must not have been writing enough text, even though I still couldn't see whatever text it did write.
In this particular case, granular control over CoT length would have sufficed even without visible CoT. If OpenAI had provided a max token length param, I could have tuned this param by trial and error like I did with Claude.
Even then, though, I would have had to guess that length was the issue in the first place.
And in the general case, if I can't see the CoT, then I'm shooting in the dark. Iterating on a prompt (or anything else) goes a lot quicker when you can actually see the full consequences of your changes!
In short: from an end user's perspective, CoT visibility is a capabilities improvement.
I ended up just switching to 3.7 Sonnet for the task discussed above – not because it was "smarter" as a model in any way I knew about, but simply because the associated API made it so much easier to construct prompts that would effectively leverage its intelligence for my purposes.
This strikes me as a very encouraging sign for the CoT-monitoring alignment story.
Even if you have to pay an "alignment tax" on benchmarks to keep the CoT legible rather than accepting "neuralese," that does not mean you will come out behind when people try to use your model to get things done in real life. (The "real alignment tax" is likely more like an alignment surplus, favoring legible CoT rather than penalizing it.)
One might argue that eventually, when the model is strongly superhuman, this surplus will go away because the human user will no longer have valuable insights about the CoT: the model will simply "figure out the most effective kinds of thoughts to have" on its own, in every case.
But there is path dependency here: if the most capable models (in a practical sense) are legible CoT models while we are still approaching this superhuman limit (and not there yet), then the first model for which legible CoT is no longer necessary will likely still have legible CoT (because this will be the "standard best practice" and there will be no reason to deviate from it until after we've crossed this particular threshold, and it won't be obvious we've crossed it except in hindsight). So we would get a shot at alignment-via-CoT-monitoring on a "strongly superhuman" model at least once, before there were any other "strongly superhuman" models in existence with designs less amenable to this approach.
If I had been using a "non-reasoning" model, I would have forced it to do things the "right way" by imposing a structure on the output. E.g. I might ask it for a json object with a property that's an array having one element per loop iteration, where the attributes of the array elements express precisely what needs to be "thought about" in each iteration.
Such techniques can be very powerful with "non-reasoning" models, but they don't work well with reasoning models, because they get interpreted as constraining the "output" rather than the "reasoning"; by the time the model reaches the section whose structure has been helpfully constrained by the user, it's already done a bunch of mostly uncontrollable "reasoning," which may well have sent it down a bad path (and which, even in the best case, will waste tokens on correct serialized reasoning whose conceptual content will be repeated all over again in the verbose structured output).
This is one way that reasoning models feel like a partial step backwards to me. The implicit premise is that the model can just figure out on its own how to structure its CoT, and if it were much smarter than me perhaps that would be true – but of course in practice the model does "the wrong sort of CoT" by default fairly often, and with reasoning models I just have to accept the default behavior and "take the hit" when it's wrong.
This frustrating UX seems like an obvious consequence of Deepseek-style RL on outcomes. It's not obvious to me what kind of training recipe would be needed to fix it, but I have to imagine this will get less awkward in the near future (unless labs are so tunnel-visioned by reasoning-friendly benchmarks right now that they don't prioritize glaring real-use problems like this one).