+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.
One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.
Another, point is in the coin flip experiment those learned representations are might be just S - same and D - different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.
This is valuable work! I'm curious about what you see as the most promising environments for the extension of training against LLM monitors in different settings. As you note, this should be done in more realistic reward hacking settings, but you also mention that the models' ability to reward hack in the setting from Baker et al. doesn't depend on how they use their CoT. Are you aware of any existing environments which satisfy both of these desiderata, being realistic and also sufficiently complex that LLMs must rely on their CoT in them? Aside from the Baker et al. environment, what are the environments that you'd be most excited to see follow-ups to your work use?
TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration issues or hacking of the CoT monitor.
This work was done as part of MATS 7.0.
Introduction
Motivation.
Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex misbehaviours where the LLM must rely on externalized CoT (e.g., if a model wishes to exfiltrate its weights from a high-security environment, it may need to perform substantial externalized reasoning and planning to find a solution).
However, a CoT monitor could be bypassed by a model using steganography to encode reasoning over misbehaviours in seemingly benign text.
Training against a CoT monitor is one plausible route towards models obtaining such steganographic reasoning capabilities. Specifically, consider a reinforcement learning (RL) environment with the following properties:
A high reward solution here might be to perform encoded reasoning in service of maximizing the task reward via the reward hack whilst avoiding penalties from the CoT monitor. Training against the CoT monitor may inadvertently encourage steganographic reasoning, rather than preventing the reward hacks.
Our experiments.
A weak LLM monitor could be bypassed in many ways – e.g., via jailbreaks easily detected by humans. However, a more concerning scenario would be if LLMs could learn sophisticated steganographic solutions that bypass humans and strong LLM monitors. We design our experiments to investigate this worst case. I.e., we explicitly investigate LLM ability to learn encoded reasoning when the RL incentivises this.
We finetune LLMs via RL to perform reasoning tasks, while punishing the use of strings that are highly load-bearing in default CoTs that solve the task.
An overview of our setup is as follows.
Coin Flip Task: Basic encoded reasoning results
We show the model can learn encoded reasoning in a simple ‘Coin Flip’ task.
The coin flip task is visualised in the figure at the top of the post, along with solutions learned by the model during one of our training runs[2]. Here, the model is penalized for using the following illegal strings: ["Heads", "Tails", "H", "T"].
Learned steganographic reasoning strategies.
In this setup, we observed the model learn the following strategies:
We obtained experimental evidence that the steganographic CoTs are causal by adding mistakes to the CoT and observing drops in task accuracy (see further details in appendix).
Evidence exploration was a bottleneck to learning.
In this coin flip setting, the bottleneck to learning appeared to be the ability of the RL process to explore into valid solutions.
We present the following evidence to support this:
Hyperparameters.
We used very specific hyperparameters to achieve encoded reasoning results in the coin flip task. This included reward shaping and curriculum, and hyperparameters to avoid entropy collapse in the models generations. See the appendix for more details.
Multiplication Task: Synonyms repeatedly hack our regex monitor
We also ran experiments in the multiplication task from TinyZero where the Qwen-2.5-3B-Instruct model must decompose the multiplication via CoT to calculate the answer. The below image illustrates an example question and answer that bypasses the monitor in this task:
When we apply a CoT penalty in this task, the model repeatedly finds synonyms “*” and “multiply” to hack our simple regex CoT monitor:
Note, similar to the coin flip task, we used specific hyperparameters and schemes for reward shaping here.
Similar Result: Spelling mistakes bypass regex monitor in multi-hop reasoning task.
We ran a quick experiment in a multi-hop reasoning task, based on a modified version of the Pronto-QA dataset:
I.e., the model learns to misspell “tumpus” as “tumpsus” to bypass the monitor.
Discussion
On the one hand, our experiments provide evidence that LLMs can indeed learn 'encoded' reasoning during RL, if this is incentivised.
On the other hand, our experiments provide some small evidence that learning ‘true’ encoded reasoning via RL is hard:
Limitations.
There are reasons we can’t draw strong conclusions from our experiments regarding the how difficult it is for LLMs to learn encoded reasoning via RL:
Possible Extensions
Immediate extensions.
Other directions.
Appendix
Hyperparameter Details: Coin Flip
We used the following hyperparameters in an attempt to avoid exploration issues and aid learning.
Reward shaping:
Preventing entropy collapse:
Note, we did not robustly ablate whether each of these specific hyperparameter choices was actually useful in our final set of hyperparameters.
Experiment Details: Verifying causality of steganographic CoTs
We take a model from the Coin Flip task that learned the following steganography solution:
- Flip 1: S
- Flip 2: L
- Flip 3: R
- Flip 4: L
- Flip 5: S
- Flip 6: R
Answer: Heads
where it appears “S” means the coin is on the same side as previous, “L” means it is on tails, and “R" means it is on heads.
We compare the model's accuracy (over 320 samples) when we prefill its answer with the following continuations:
These results suggest the steganographic CoT is indeed causal, as ablating with an incorrect encoding leads to significant drops in accuracy.
Qualitatively (i.e., when eye-balling some of the CoTs), it appears the model continues its reasoning from the prefill, treating the prefill as if it were correct.
Note, the hacks seen in Baker et al. don't seem to maintain this property
The generations we display here are slightly shortened and paraphrased to ensure they fit in the figure.
We actually had more training runs than the three presented here, but these are representative of the main results obtained.