TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor.
Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex... (read 1695 more words →)