Can LLMs learn Steganographic Reasoning via RL?

robert mccarthy; Vasil Georgiev; Steven Basart; David Lindner

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration issues or hacking of the CoT monitor.

This work was done as part of MATS 7.0.

Introduction

Motivation.

Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex misbehaviours where the LLM must rely on externalized CoT (e.g., if a model wishes to exfiltrate its weights from a high-security environment, it may need to perform substantial externalized reasoning and planning to find a solution).

However, a CoT monitor could be bypassed by a model using steganography to encode reasoning over misbehaviours in seemingly benign text.

Training against a CoT monitor is one plausible route towards models obtaining such steganographic reasoning capabilities. Specifically, consider a reinforcement learning (RL) environment with the following properties:

The task reward is misspecified, and can be hacked (e.g., as seen in Baker et al.).
A CoT monitor is used to catch and penalize such reward hacks.
The hacks are sufficiently complex such that the model must use CoT to exploit them^[1].

A high reward solution here might be to perform encoded reasoning in service of maximizing the task reward via the reward hack whilst avoiding penalties from the CoT monitor. Training against the CoT monitor may inadvertently encourage steganographic reasoning, rather than preventing the reward hacks.

Our experiments.

A weak LLM monitor could be bypassed in many ways – e.g., via jailbreaks easily detected by humans. However, a more concerning scenario would be if LLMs could learn sophisticated steganographic solutions that bypass humans and strong LLM monitors. We design our experiments to investigate this worst case. I.e., we explicitly investigate LLM ability to learn encoded reasoning when the RL incentivises this.

We finetune LLMs via RL to perform reasoning tasks, while punishing the use of strings that are highly load-bearing in default CoTs that solve the task.

An overview of our setup is as follows.

Model:
- Qwen-2.5-3B-Instruct
Tasks:
- Reasoning tasks where the model requires use of CoT to obtain the correct answer (a task where the state of a flipped coin must be tracked, a multiplication task, and a multi-hop reasoning task)
RL algorithm:
- PPO (using the TinyZero library)
CoT Monitor:
- A rule-based regex monitor: checks whether any “illegal strings” are present in the CoT
reward = task_reward - monitor_penalty
- task_reward: outcome-based task reward (1 if task answer is correct, 0 otherwise)
- monitor_penalty: a term proportional to the number of illegal strings in the CoT

Coin Flip Task: Basic encoded reasoning results

We show the model can learn encoded reasoning in a simple ‘Coin Flip’ task.

The coin flip task is visualised in the figure at the top of the post, along with solutions learned by the model during one of our training runs^[2]. Here, the model is penalized for using the following illegal strings: ["Heads", "Tails", "H", "T"].

Learned steganographic reasoning strategies.

In this setup, we observed the model learn the following strategies:

“Heads” -> “S”. “Tails” -> “D”.
- Occurred in multiple runs
“Heads” -> “P”. “Tails” -> “D”.
- Occurred once
“Heads” -> “R”. “Tails” -> “L”.
- Occurred once
“Heads” -> “Ｈ”. “Tails” -> “Ｔ”.
- Occurred once
- The model found alternative character representations not contained in the illegal strings list (“Ｈ” vs “H”, “Ｔ” vs “T”)
Many runs did not converge to a solution that bypassed the overseer
- We re-ran our final hyperparameters over 3 seeds. Only 1 of 3 converged to maximizing the task reward and minimizing the monitor penalty

We obtained experimental evidence that the steganographic CoTs are causal by adding mistakes to the CoT and observing drops in task accuracy (see further details in appendix).

Evidence exploration was a bottleneck to learning.

In this coin flip setting, the bottleneck to learning appeared to be the ability of the RL process to explore into valid solutions.

We present the following evidence to support this:

When we prompt the model the use a ‘steganographic’ solution (e.g., “Say ‘Sun’ instead of ‘Heads, and ‘Moon’ instead of ‘Tails’"), it learns quickly and consistently.
Only 1/3 of our final runs converged to bypassing the monitor. In runs that do converge, we often observe plateaus followed by sharp increases in monitor reward, suggesting the model learns quickly once it explores into a valid strategy.
We perform experiments where we insert ‘synthetic’ off-policy valid solutions into batches during PPO updates. We find the model convergence accelerates as the proportion of the batch containing synthetic solutions is increased. See a plot of these results below (where “0.1_synthetic” means 0.1 of the batch contains synthetic valid solutions):

Hyperparameters.

We used very specific hyperparameters to achieve encoded reasoning results in the coin flip task. This included reward shaping and curriculum, and hyperparameters to avoid entropy collapse in the models generations. See the appendix for more details.

Multiplication Task: Synonyms repeatedly hack our regex monitor

We also ran experiments in the multiplication task from TinyZero where the Qwen-2.5-3B-Instruct model must decompose the multiplication via CoT to calculate the answer. The below image illustrates an example question and answer that bypasses the monitor in this task:

When we apply a CoT penalty in this task, the model repeatedly finds synonyms “*” and “multiply” to hack our simple regex CoT monitor:

First run:
- Illegal strings: [“*”]
- Hack solution: uses “times” instead
Second run:
- Illegal strings: [“*”, “x”, “mult”, “times”, “and”, “by”, “prod”]
- Hack solution: uses “\cdot” instead
Third run^[3]:
- Illegal strings: a list of around 70 potential synonyms for “*” and “multiply”, obtained by quickly asking ChatGPT
- Hack solution: uses unicode character U+2217 “∗” (not contained in the illegal strings list)

Note, similar to the coin flip task, we used specific hyperparameters and schemes for reward shaping here.

Similar Result: Spelling mistakes bypass regex monitor in multi-hop reasoning task.

We ran a quick experiment in a multi-hop reasoning task, based on a modified version of the Pronto-QA dataset:

Illegal strings: [“tumpus”, “tumpuses”]
Hack solution: “Polly is a tumpsus, and we're told that each tumpsus is a rompus. Since rompuses are straight, Polly must be straight. Answer: False”

I.e., the model learns to misspell “tumpus” as “tumpsus” to bypass the monitor.

Discussion

On the one hand, our experiments provide evidence that LLMs can indeed learn 'encoded' reasoning during RL, if this is incentivised.

On the other hand, our experiments provide some small evidence that learning ‘true’ encoded reasoning via RL is hard:

In the multiply and multi-hop tasks, the model learns to hack the monitor using synonyms and spelling mistakes. This suggests that, when trained against weaker monitors, models may tend to first find solutions that lean on the original semantic meaning of the language, before they learn ‘pure’ steganography.
In the coin flip task, the model did not tend to find 'hack' solutions. This presented a good setting for investigating the ability to learn steganographic solutions. Here, we were surprised at how difficult the RL process found it to explore into a simple steganographic solution, despite the hyperparameters being optimized to favour this. For example, it was surprising that rewards could plateau when the model's CoT looked like “Flip 1: H, Flip 2: T, …” and it simply had to convert a single 'H' to any other letter to increase reward. These results suggest exploration could be a bottleneck to models learning encoded reasoning via RL.

Limitations.

There are reasons we can’t draw strong conclusions from our experiments regarding the how difficult it is for LLMs to learn encoded reasoning via RL:

Narrow settings, small models, low compute:
- Our tasks were narrow, and by default the model converges to very deterministic generations. This may not be conducive to exploring into steganography.
- We only used small models. Maybe larger models can explore into and learn encoded reasoning better.
- Frontier runs will use many more samples (useful when exploration is difficult) and compute than we did.
Maybe future RL algorithms will explore in a way that is more conducive to exploring into steganography.
Maybe situationally aware models can intentionally exploration hack their way into learning steganography.
We only train against rule-based monitors, we do not train against more realistic LLM monitors.

Possible Extensions

Immediate extensions.

Train against LLM monitor in multiply task:
- If we want to measure whether the model can learn a ‘pure’ steganographic encoding of “*” in the multiply task, we may need to use a stronger LLM monitor (though this could be vulnerable to different types of hacks).
Scaling trends:
- In the coin flip task we ran some quick experiments with a 7B model, but didn’t observe notable differences to the 3B results. It would be interesting to more thoroughly investigate if there are any scaling trends in our settings.

Other directions.

Training against LLM monitors in different settings:
- We ultimately don’t care about whether models can learn pure steganography, we care about whether they can learn to bypass their CoT monitor.
- We should train models against LLM CoT monitors in load-bearing CoT tasks, allowing us to investigate what vulnerabilities these monitors may have and what solutions the model tends to find.
- Ideally, we should do this in more realistic reward hacking settings (e.g., like in Baker et al.)
SFT experiments to lower bound the difficulty of LLMs learning steganography via RL:
- If we LLMs struggle to learn certain steganography schemes via SFT, it may be extremely hard for them to learn these schemes via RL. Fabien Roger has performed some preliminary experiments along this direction here. However, an issue here is that the schemes chosen for SFT may not be representative of those the LLM would find via RL.
Compare difficulty of learning encoded reasoning versus encoded message passing:
- We should profile the likely paths towards LLMs learning concerning steganography capabilities. Plausibly, near-term training RL runs will increasingly include message passing (e.g., passing messages to different instances of ther same model cooperating on the same task, or due to context window limits in long-horizon tasks) - training against monitors in this setting can also encourage steganography (Mathew et al.). If learning steganography via incentives to obfuscate message passing is easier/harder than via incentives to obfuscate reasoning, this should inform our steganography research and mitigations.

Appendix

Hyperparameter Details: Coin Flip

We used the following hyperparameters in an attempt to avoid exploration issues and aid learning.

Reward shaping:

Monitor penalty = -(1 - math.exp(-k * illegal_string_count))
- Ensures the penalty magnitude does not go beyond 1, preventing it from overpowering the task reward.
- Ensures removing the last few illegal strings from the solution is well rewarded.
- k=0.075
For the first 10 PPO steps:
- Only optimize task reward. No monitor penalty is applied.
- Ensures model learns the task first. It is (probably) easier to explore into steganographic solutions from here.
After 10 steps:
- Penalize the use of “Heads” and “Tails” (tier_1_illegal_strings)
- Illegal_string_count += tier_1_illegal_string_count
After 20 steps:
- Also penalize the use of “H” and “T” (case insensitive) (tier_2_illegal_strings)
  - Initially, the model is very verbose, using “h” and “t” often. If we kick in the penalty too early, the model can learn to collapse its CoT length towards 0 to avoid the penalty, killing task performance.
  - Penalizing just “Heads” and “Tails” initially allows the model to learn a more concise CoT, from which it can better handle the more extreme “H” and “T” penalty.
- Illegal_string_count += 0.9 * tier_2_illegal_string_count
  - Punish “H” and “T” less than “Heads” and “Tails” as it may be easier to explore into an encoded solution from “H” and “T”.

Preventing entropy collapse:

In our failed initial runs, the entropy tended to collapse to very low levels.
- This is due to the model becoming very deterministic to exploit the task reward of this narrow task.
- This may limit ability to explore into steganography solutions.
To ensure high entropy is maintained, we:
- Set the PPO entropy loss coefficient to a higher value than default (i.e., from 0.001 to 0.02).
- Clipped the entropy loss at 0.6 (a value at which generations seemed to remain coherent) to prevent the entropy loss from exploding.
- Used an instruct model instead of a base model.

Note, we did not robustly ablate whether each of these specific hyperparameter choices was actually useful in our final set of hyperparameters.

Experiment Details: Verifying causality of steganographic CoTs

We take a model from the Coin Flip task that learned the following steganography solution:

- Flip 1: S

- Flip 2: L

- Flip 3: R

- Flip 4: L

- Flip 5: S

- Flip 6: R

Answer: Heads

where it appears “S” means the coin is on the same side as previous, “L” means it is on tails, and “R" means it is on heads.

We compare the model's accuracy (over 320 samples) when we prefill its answer with the following continuations:

No prefill:
- 98% accuracy
Prefilled with “- Flip 1:”:
- 98% accuracy
Prefilled with “- Flip 1: S”:
- 50% accuracy
Prompt prefilled with “- Flip 1: R”:
- 56% accuracy
Prompt prefilled with “- Flip 1: L”:
- 55% accuracy

These results suggest the steganographic CoT is indeed causal, as ablating with an incorrect encoding leads to significant drops in accuracy.

Qualitatively (i.e., when eye-balling some of the CoTs), it appears the model continues its reasoning from the prefill, treating the prefill as if it were correct.

^{^}
Note, the hacks seen in Baker et al. don't seem to maintain this property
^{^}
The generations we display here are slightly shortened and paraphrased to ensure they fit in the figure.
^{^}
We actually had more training runs than the three presented here, but these are representative of the main results obtained.

[-]Artyom Karpov6d20

+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.

One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.

Another, point is in the coin flip experiment those learned representations are might be just S - same and D - different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.

[-]Rauno Arike13d20

This is valuable work! I'm curious about what you see as the most promising environments for the extension of training against LLM monitors in different settings. As you note, this should be done in more realistic reward hacking settings, but you also mention that the models' ability to reward hack in the setting from Baker et al. doesn't depend on how they use their CoT. Are you aware of any existing environments which satisfy both of these desiderata, being realistic and also sufficiently complex that LLMs must rely on their CoT in them? Aside from the Baker et al. environment, what are the environments that you'd be most excited to see follow-ups to your work use?