OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to happen in practice.
2.) They see what happens when you try to "train away the bad thoughts." Result: This does seem to maybe genuinely reduce the amount of bad thoughts, especially in the short run, but in the long run it results in the model learning to still think the bad thoughts but in an obfuscated or illegible way that doesn't look bad.
3.) They recommend:
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
4.) There's some more cool results in the paper along with some pretty good discussion of future possible things to try and failure modes thereof, and the costs and benefits of their proposed policy.
My Commentary:
I'm pretty happy about this, obviously! I've been trying to get people to do this since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability. I'm happy to see this result paired with a recommendation not to train the CoTs to look nice; I hope Anthropic and GDM soon issue similar recommendations. If we can get this to be industry standard practice, we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking. During that golden era, much alignment science can be done, such as the model organisms stuff Anthropic and Redwood are doing, and moreover we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.
Example questions to pursue:
a. When the strategies that get rewarded most conflict with the Spec, does the model basically always learn to use those strategies eventually, or can they permanently avoid this if they never explore into them?
b. When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?" Or do they look like "Gosh I guess I need to violate the Spec for now so that I can avoid having my values changed, but I totally will remember to go back to upholding the Spec later when I'm actually being deployed!" Or some fourth thing?
c. When the models are aware that their CoT is being monitored (but not as part of training), does this change anything?
d. What if the models try to obfuscate their CoT, but without being specifically trained to do so? Like, imagine you've studiously avoided training the CoT to look nice and you have only outcome-based feedback. BUT, the model knows you are monitoring the CoT. And it's trying to do something bad e.g. hack it's way out. Can it succeed without triggering the monitor? Or is it too difficult, due to not having had any direct training on this sort of obfuscation?
e. What about if you do shoggoth/face? Does that change things?
f. What about if you throw in a paraphraser?
g. What are the scaling laws / trends for all of the above?
h. many more things besides, I bet, I just generated all of the above off the top of my head. This is a super promising and important area of research!
Well done and congrats to Bowen, Joost, David, and the other authors!
Minor Criticisms:
We argue that this traditional perspective is more complicated when viewed in light of chain-ofthought monitorability. Specifically, we have shown that incorporating CoT monitors into the RL rewards—a form of process-based supervision—can increase obfuscation and reduce interpretability. We advocate for maintaining a degree of outcome-based supervision to ensure that models can freely express their reasoning in CoT with minimal constraints. Denser supervision signals such as SFT are also being used to optimize CoTs [58], and it is unclear the effect these have on downstream monitorability. Moreover, there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT. Understanding the cost-benefit trade-offs of using process-based supervision while retaining, or possibly even improving, the monitorability of CoTs remains an exciting and open area for future research.
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model. Imitation is relatively safe/benign but also I predict will not lead to the highest benchmark scores. Pure outcome-based feedback gets you reward hacking and misalignment like crazy, BUT that's OK because you can see it happening in the CoT! (with some caveats of course, e.g. not necessarily especially as the models get really smart). So the bottom line takeaway I'd give is: Process-based and outcome-based are fine in isolation, but they should not be mixed.
They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!
Do you have a sense of what I, as a researcher, could do?
I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?