Vasil Georgiev

Message

10mo

Can LLMs learn Steganographic Reasoning via RL?

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor....

Apr 11, 2025•30

Message

27 karma

Member for 10 months

Vasil Georgiev — LessWrong

Vasil Georgiev

Message

10mo

Vasil Georgiev

Can LLMs learn Steganographic Reasoning via RL?

Apr 11, 2025•30

Message

27 karma

Member for 10 months

Can LLMs learn Steganographic Reasoning via RL?

robert mccarthy

robert mccarthy, Vasil Georgiev, Steven Basart, David Lindner+ 0 more

robert mccarthy, Vasil Georgiev, Steven Basart, David Lindner

10mo

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor.

This work was done as part of MATS 7.0. These results were used in Large language models can learn and generalize steganographic chain-of-thought under process supervision.

Introduction

Motivation.

Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex... (read 1695 more words →)

LESSWRONG
LW

LESSWRONG
LW

Vasil Georgiev

Vasil Georgiev

Vasil Georgiev

Can LLMs learn Steganographic Reasoning via RL?

Vasil Georgiev

Vasil Georgiev

Vasil Georgiev

Can LLMs learn Steganographic Reasoning via RL?

Introduction