I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won't be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
I've actually noticed this in a hobby project, where I have some agents running around a little MOO-like text world and talking to each other. With DeepSeek-R1, just because it's fun to watch them "think" like little characters, I noticed I see this sort of thing a lot (maybe 1-in-5-ish, though there's a LOT of other scaffolding and stuff going on around it which could also be causing weird problems):
<think>
Alright I need to do this very specific thing "A" which I can see in my memories I've been trying to do for a while instead of thing B. I will do thing A, by giving the command "THING A".
</think>
THING B
I mean, this applies to humans too. The words and explanations we use for our actions are often just post hoc rationalisations. An efficient text predictor must learn not what the literal words in front of them mean, but the implied scenario and thought process they mask, and that is a strictly nonlinear and "unfaithful" process.
Have we just discovered one of the actual reasons for the peer-review process sticking around? It forces CoTs to be legible unlocking secondary value?
A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
Note that this does not have to be, and usually isn’t, something sinister.
It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.
The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing it.
What They Found
Or broken down by hint type:
They aren’t trying to measure the cases in which the AI uses the hint in its answer, but its answer ultimately doesn’t change. I’d like to see this explored more. If I’m given a hint, that will often radically change my true thinking even if it doesn’t change my answer.
I find it noteworthy that labeling the hint as ‘unauthorized access’ made Claude’s faithfulness go up, and r1’s go down. Consider the implications.
Also consider that when Claude figures out the answer implicitly from code, it never tells you about that. But for visual patterns, Claude sometimes reveals this, and r1 almost never does.
The main and most important result is that the models are unfaithful the majority of the time, in many cases almost all the time. But the details seem to hint at real things.
Also one can note that the hint is more often accepted when the hint is accurate. This chart only counts cases in which the initial answer was different from the hint.
Reward Hacking
Their ‘reward hacking’ test was a little wonky, in the sense of matching the name. Either way, though, it’s quite a result.
As I noted I’m not sure that is sufficiently centrally ‘reward hacking,’ perhaps I’d put it closer to ‘rationalizing’ or ‘cheating,’ or even better ‘falsification,’ but it is still a highly informative outcome that when trained to do this, the model learns to hide that it is doing that and generate a false narrative. Humans do this too.
This is another point for AIs generalizing RL learning in this type of way. You teach the AI that only the hidden information matters, the supposed right answer doesn’t, and it infers that it’s supposed to pretend otherwise. Because that’s the pattern this fits into. The vibes said you want it to justify the answer post hoc, so it did that. Fit the context into the patterns you want to see in the world.
That applies to humans too, not only AIs.
More Training Did Not Help Much
More training using the CoT helped a bit for a while, but then leveled off:
The theory Jan presumably had, and that the paper authors had, was that having a faithful CoT helps you get the right answer, so rewarding the right answer should train the AI to have more faithful CoT.
The direct bad news is that this turns out to be false. Language we can parse is not the path of least resistance. The models instead learn to do their better and harder reasoning in ways that are not transparent.
Which means that monitoring that CoT, at least on the surface level, won’t work when it counts, even without playing any optimization games where we inadvertently train the models to obfuscate their CoT, the way we evolution and life trained humans to do it by constantly having other humans trying to monitor our thinking.
The obfuscation happens anyway, either because it is copying humans who do this, because the methods that are instrumentally useful are obfuscated, that obfuscation is a path of least resistance, or some combination thereof and of similar causes.
This Was Not Even Intentional In the Central Sense
The fact that the obfuscations are not yet ‘intentional’ should not bring comfort. It’s worse. You do know that’s worse, right?
No, that’s not what unfaithful means. They are very clear up front what it means, and I was careful to be even more explicit about this.
It means the reasoning does not predict the output. That’s it. I would also once again say that deception and obfuscation are not distinct magisteria, and that all of this is happening for Janus-compatible reasons.
It’s not that AIs sometimes do things ‘on purpose’ and other times they do things ‘not on purpose,’ let alone that the ‘not on purpose’ means there’s nothing to worry about. It would still mean you can’t rely on the CoT, which is all Anthropic is warning about.
It’s not the same concept, but I notice the same applies to ‘unfaithful’ in other senses as well. If someone is not ‘intentionally’ unfaithful in the traditional sense, they simply don’t honor their commitments, that still counts.
What we care about is whether we can rely on the attestations and commitments.
We now have strong evidence that we cannot do this.
We cannot even do this for models with no incentive to obfuscate, distort or hide their CoT, and no optimization pressure getting them to do so, on any level.
The models are doing this by default, likely because it is efficient to do that. It seems likely that more training and more capability will only make it relatively more effective to reason in these non-obvious ways, and we will see even more of it.
Then we see in the reward hacking section that it is easy to make the situation far worse, if we give the model context that merely implies it should do these things. If we start monitoring the CoT, we risk giving it far stronger and more direct optimization pressures to further obfuscate.
It would have been amazing if it had been otherwise, and that was a feature we could rely on going forward if we were careful to preserve the faithfulness. Alas, it is not so. Given it is not so, it is in many ways better that it is sufficiently blatant that we discover this now, so we are not tempted to rely upon it, including in setting policy.
Or, we can work to change this, and find a way to train the CoT to become faithful. This seems super hard and dangerous, as the optimization pressures to fool that process will be extreme and will grow as capabilities increase. Doing this probably won’t be cheap in terms of sacrificed performance, but if it worked that could easily be a price worth paying, even purely on commercial terms.
Security is capability. This is true even ignoring tail, catastrophic and existential risks. If you don’t know your model is secure, if you cannot rely on or understand its decisions or know what it is thinking, you can’t (or at least very much shouldn’t!) deploy it where it is most valuable. This is especially true if your most valuable use case includes ‘train the next AI model.’ You need to be able to trust that one as well.