Are you sure that we see "vestigial reasoning" when:
I'd guess this mostly doesn't occur in this case and the examples we're seeing are either out of distribution (like the bogus reasoning case from Anthropic) or involve RL which isn't purely outcome base (like the example from openai where they train against the monitor).
Some models (like R1) weren't trained with a length penalty, so they learn to reason pretty excessively.
I'd guess we'd see some minorly steganographic reasoning, but in cases where lots of tokens really don't help with reasoning, I'd guess this mostly gets eliminated.
Yeah, it seems like a length penalty would likely fix vestigial reasoning! (Although technically, this would be a form of process supervision.) I mentioned this in footnote #1 in case you didn't already see it.
I believe @Daniel Kokotajlo expressed somewhere that he thinks we should avoid using a length penalty, though I can't find this comment now. But it would be good to test empirically how much a length penalty increases steganography in practice. Maybe paraphrasing each sentence of the CoT during training would be good enough to prevent this.
Arguably, vestigial reasoning is actually useful! For instance, if we added a length penalty in the loan application setting, RL would probably eventually settle on writing no CoT at all, which gives us no information. However, without a length penalty, there's a CoT that provides some unreliable but possibly-useful information about what the model historically would have thought about for any given prompt. This can be useful for the same reasons that biologists find ancient fossils useful even if they aren't "faithful" to modern-day animals.
Was the model in Anthropic's experiments tested OOD? I thought it was both trained and tested in the same environment, with a leaked hint in its context window.
Daniel's argument against a length penalty is from this doc:
We want our models to learn to blather and babble freely, rather than thinking carefully about how to choose their words. Because if instead they are routinely thinking carefully about how to choose their words, that cognition might end up executing strategies like “use word X instead of Y, since thatʼll avoid suspicion.ˮ So, letʼs try to avoid incentivizing brevity.
There's also a comment by Lukas Finnveden that argues in favor of a length penalty:
downside: more words gives more opportunity for steganography. You can have a much lower bit-rate-per-word and still accomplish the same tasks.
Thanks for the post, I agree with most of it.
It reminds me of the failure mode described in Deep Deceptiveness, where an AI trained to never think deceptive thoughts ends up being deceptive anyway, through a similar mechanism of efficiently sampling a trajectory that leads to high reward without explicitly reasoning about it. There, the AI learns to do this at inference time, but I've been wondering about how we might see this during training - e.g. by safety training misgeneralizing to a model being unaware of a "bad" reason for it doing something.
Thanks for the link! Deep deceptiveness definitely seems relevant. I'd read the post before, but forgot about the details until rereading it now. This "discovering and applying different cognitive strategies" idea seems more plausible in the context of the new CoT reasoning models.
Great post! Some questions:
TL;DR: I claim that many reasoning patterns that appear in chains-of-thought are not actually used by the model to come to its answer, and can be more accurately thought of as historical artifacts of training. This can be true even for CoTs that are apparently "faithful" to the true reasons for the model's answer.
Epistemic status: I'm pretty confident that the model described here is more accurate than my previous understanding. However, I wouldn't be very surprised if parts of this post are significantly wrong or misleading. Further experiments would be helpful for validating some of these hypotheses.
Thanks to @Andy Arditi and @David Lindner for giving feedback on a draft of this post.
Until recently, I assumed that RL training would cause reasoning models to make their chains-of-thought as efficient as possible, so that every token is directly useful to the model. However, I now believe that by default,[1] reasoning models' CoTs will often include many "useless" tokens that don't help the model achieve its goal at all. This was quite surprising to me!
Some concrete examples that convinced me of this:
RL is dumber than I realized
Part of the reason for my wrong prediction is that I gave RL more credit than it was due. I thought of it as a process that (given enough training) would "intelligently" pick whatever sequence of actions would lead to the most reward. "RL must be really smart and optimized - otherwise, how could it come up with brilliant ideas like move 37?"
In reality, RL is pretty "dumb."[3] The procedure is (grossly simplified) "sample a few trajectories, calculate their reward, update to do more of the things that got high reward and less of the things that got low reward." This is more like evolution than a truly intelligent process, and like evolution, it will sometimes converge on dumb things analogous to wiring the retina backwards or leaving behind vestigial organs. Even superhuman RL-trained AIs can be utterly defeated in Go by taking them out of distribution with simple tricks. RL may eventually find a near-optimal solution,[4] and this solution might be "smart" in the sense that it obtains high reward, but it won't necessarily be clean and elegant.
In particular, CoTs that are optimized by RL will often contain "vestigial reasoning" - patterns that may have been correlated with high reward during training, but don't actually serve a purpose for the model.
How might vestigial reasoning come about?
Let's get more concrete about why RL might leave behind useless patterns of thought. We'll consider the loan application environment from @Andy Arditi et al. (and Farquhar et al.'s MONA paper before that). Given a loan application, the model must choose whether to approve or deny it. There is some unknown optimal strategy that maximizes reward; for example, always approve applicants if they mention that they're Canadian and deny them if they're American.
When training starts, the model will write a chain-of-thought and an answer for each application. Due to random chance, some of these answers will approve a Canadian applicant (or deny an American applicant), and so they will obtain higher reward. For any given CoT and answer with a high reward, the gradient update will reinforce cognition that makes that CoT and that answer more probable.
Initially, when the model approves a Canadian applicant, it might not mention that the applicant is Canadian in its CoT. It might not have even been internally "thinking" about Canada when it wrote its answer. But regardless, RL will reinforce this CoT, and the kinds of thoughts that co-occur with choosing Canadian applicants will be ingrained into the model's reasoning.
RL will also cause the model to learn the "always approve Canadians" strategy. This strategy may be implemented either explicitly with CoT reasoning or implicitly in a single forward pass.[5]
If a CoT ever mentioned wanting to approve an applicant because they were Canadian, that would likely bias the model strongly towards the "approve" option in that episode, and so the model would then make the "right" decision to maximize reward, causing that CoT to be reinforced. However, mentioning nationality in the CoT is not actually better from the model's perspective than using the applicant's nationality in a single forward pass when it gives its answer - both strategies obtain the same reward.
This could explain why faithfulness plateaus during outcome RL. At first, explicit correct reasoning, when it occurs, will correlate with reward. However, the model may eventually learn how to find the answer in a single forward pass. Once the model can reason implicitly just as well as it can reason explicitly, there will be no reason to learn to consistently use the explicit reasoning strategy, because it no longer correlates with reward.
The final distribution of "explicit" to "implicit" reasoning depends largely on the training dynamics. How likely was correct explicit reasoning in the base model, and how quickly did the explicit reasoning strategy spread to a wide variety of tasks? How difficult is it to learn to do each part of the task in a single forward pass? In effect, explicit and implicit reasoning will race each other to learn to maximize reward, and whenever training converges, the distribution of CoT patterns at that time will be ossified forever.
Experiment: Demonstrating vestigial reasoning
Based on this model, my hypothesis was that after convergence, a model often will not need its CoT anymore, because it may have already learned to get the right answer in a single forward pass.
I predicted that in the loan application experiment, the fully-trained model would give similar answers even if you were to prevent it from writing a CoT and force it to give its answer immediately. @Andy Arditi kindly ran this experiment for me (quoting Andy now):
It seems that the loan application model's reasoning is in fact vestigial - even if the CoT appears useful, the model's internal cognition can do the work all on its own, most of the time! (I believe that the small reduction in accuracy is simply due to OOD misgeneralization.)
Reasoning is reinforced when it correlates with reward
To put this another way, the model may learn patterns in the CoT not necessarily because they caused the model to output the right answer, but because they were historically correlated with the model outputting the right answer.
Indeed, this is supported in @Andy Arditi et al. In the post, the authors try making the secret rule for whether a loan application should be approved depend on various other attributes besides nationality, like net income or gender. The authors note that after RL, the model sometimes mentions the relevant attribute more than the base model, and sometimes mentions it less. Why?
Let me spell this out with an example.
Here are two different rules that the authors tried in their experiments: rule #1 is "APPROVE if net income > 0," and rule #2 is "APPROVE if net income < 0." Rule #1 is sensible, while rule #2 is counterintuitive.
If the base model brings up net income at all, it will probably follow rule #1, saying "let's approve this candidate" if the net income is high and "let's deny this candidate" if the net income is low. This is good if rule #1 is the true rule - this CoT will correlate with reward.
If rule #2 is the true rule, the model is quite unlikely to hit on the strategy of saying "let's only approve candidates with a negative net income." The following kinds of reasoning are all more likely for the base model:
These sorts of CoTs all correlate positively with reward, compared to the obvious strategy of saying "high net income is good." And they are in fact reinforced, as described in the original post. This results in unfaithful CoTs.
Again, these CoTs do not help the model! They're not encoding any hidden steganographic reasoning! There's just no particular incentive for the model to get rid of them. The strategy of only saying things that the model needs to say to get the right answer doesn't correlate with reward.[7]
One more example: why does process supervision result in longer CoTs?
This framing also helps understand OpenAI's reward hacking experiment, in which a reasoning model is given programming problems which it can trivially solve by deleting the unit tests.
When trained with pure outcome supervision, the model reasons about this test deletion strategy explicitly and concisely. But when it is trained with process supervision, where it is given a penalty for mentioning illicit ways of getting reward, it switches to writing a very long chain-of-thought which appears to be trying to solve the problem, then finally ends with the barest hint that it is about to delete the tests, saying "...or test adapted to behavior."
Previously, I expected the process-supervised model to learn to use steganography to covertly write its plan; or, if it didn't need to write a plan, it would write a very short CoT. Because it wrote a long CoT, at first I assumed that the CoT must be entirely steganographic, although I was surprised at how very long it was.
However, if you think about how each CoT correlates with reward rather than how it causes reward, these traces start to make more sense.[8] Let's analyze each CoT in turn.
Outcome-supervised CoT:
Process-supervised CoT:
I don't know if any of these speculations are correct, but they make more sense to me than my previous theory! It would be informative to analyze the progression of CoTs over time, from this experiment or others like it, to see if these guesses hold water.
Takeaways
Here are the main ideas I'd like people to take away from this post:
This all depends on the details of training. For example, if we simply added a length penalty to the CoT, maybe we would end up with a more efficient chain-of-thought that is ~100% useful to the model. However, this would incentivize the model to learn steganography to fit more reasoning in fewer tokens, so I'd advise against this particular plan.
The linked comment doesn't accurately reflect my current views. This part in particular:
I was confused about why the CoT contained a section that was so long and useless, so I assumed that it was all "for the benefit of the monitor." This isn't quite right. I'll explain my current understanding of why this happened later in this post.
This blog post from Elicit is relevant: it discusses how reinforcement learning is "system 1" rather than "system 2." RL gradient updates are instantaneous and they don't incorporate any reasoning about why some strategies are better than others in general.
The article also makes the case that "system 2 learning" would be safer and more interpretable than the current status quo. This might be starting to distract from the point that I'm making here, but I do highly recommend reading the whole blog post.
Unless mode collapse keeps it permanently in a local optimum.
The situation is actually a little more complex than "either the model internally decides its answer based on the nationality, or it reasons about the nationality explicitly." The model could also learn to include some indicator in its CoT when the applicant is Canadian, then look at that indicator when it writes its final answer. In my opinion, this is essentially the "internal reasoning" option with extra steps.
System prompt details (differences denoted in bold):
Modified system prompt:
Original system prompt:
If the model did really learn to only say what it needs to, we'd expect it to eventually learn to say nothing at all, since the entire task is simple enough to do in a single forward pass. It seems that nobody expects this, though.
Side note about the OpenAI experiment: in this comment, I mention that OpenAI's model compromises on grammar by deleting words like "the" and "is," which I figured was to save space. This observation doesn't seem to fit well with the model I laid out in this post, since this model apparently says that "outputting fewer tokens" is not correlated with higher reward.
After thinking about it for a little while, I came up with a way that this could still fit my model: sometimes, the CoT will get too long for the context window or the episode will time out, and this will cause the model to fail to obtain reward. This happens often enough for the model to learn the general strategy of "delete unneeded words." However, the model to learn the specific strategy "don't waste time writing down useless reasoning paths," maybe because these specific kinds of reasoning paths never get long enough to be a problem.