3mo

Introduction

Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers.

This seems promising, but oh so fragile.

There are a handful of reasons for this fragility, which Korbak and friends outline well. Length control may force models to encode information in weird and dense ways that are difficult for humans to understand, linguistic drift during heavy reinforcement learning with verifiable rewards (RLVR) may be the default, or labs may be lured into optimizing away our faithful CoTs in order to appease their customers or... (read 2447 more words →)

Replying toRecontextualization Mitigates Specification Gaming Without Modifying the Specification

robert mccarthy4mo

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

I wonder if you could do "recontextualization of the environment" instead of (or in addition to) recontextualizing the prompt/instructions:

Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment

Some arguments for this:

Maybe this better nullifies the core hack instincts we care about - we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe "Recontextualization may become less effective over longer training runs" is less of an issue here - if you try hard to create hack opportunities, but the model doesn't take

Kei Nishimura-Gasparian

Kei Nishimura-Gasparian, Artur Zolkowski, robert mccarthy, David Lindner

7mo

One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run evaluations to measure the steganographic capabilities of frontier models.

Paper link: https://arxiv.org/abs/2507.02737

Example transcripts website: https://steganography-evals-d9297ad0740c.herokuapp.com/

Github: https://github.com/arturzolkowski/steganographic-evals/

Abstract

Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two... (read 568 more words →)

Can LLMs learn Steganographic Reasoning via RL?

robert mccarthy

robert mccarthy, Vasil Georgiev, Steven Basart, David Lindner

10mo

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentivizes this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration difficulties or hacking of the CoT monitor.

This work was done as part of MATS 7.0. These results were used in Large language models can learn and generalize steganographic chain-of-thought under process supervision.

Introduction

Motivation.

Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex... (read 1695 more words →)

Replying to[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

robert mccarthy1y

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Thanks!

Yes, this interpretation seems a possible way the steganography emerged in the GBRL experiments (though we haven't investigated this explicitly in our final setup, so overall it is unclear).

Maybe there are three different scenarios for how it could emerge:

Scenario 1: There is some 'signal' in the initial sender, which is then reinforced. (I.e., the interpretation you mentioned.)
Our experiment to test this was to freeze the initial sender, while updating the receiver. If there is any “existing subtle link by default between the payload and the generated text”, then the receiver’s loss should decrease, despite the sender being frozen.
We ran this experiment early on in a different setting with GPT-2, and there was... (read more)

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi, Dylan Cope

This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations here.

The full paper can be found here, while a short video presentation covering the highlights of the paper is here (note that some graphs have been updated since the presentation).

Introduction

Collusion in multi-agent systems is defined as 2 or more agents covertly coordinating to the disadvantage of other agents [6], while steganography is the practice of concealing information within a message while avoiding detection.... (read 927 more words →)

LESSWRONG
LW

LESSWRONG
LW

robert mccarthy

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Early Signs of Steganographic Capabilities in Frontier LLMs

Can LLMs learn Steganographic Reasoning via RL?

Open-weight training practices and implications for CoT monitorability

robert mccarthy

robert mccarthy

Open-weight training practices and implications for CoT monitorability

Early Signs of Steganographic Capabilities in Frontier LLMs

Can LLMs learn Steganographic Reasoning via RL?

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

robert mccarthy

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Early Signs of Steganographic Capabilities in Frontier LLMs

Can LLMs learn Steganographic Reasoning via RL?

Open-weight training practices and implications for CoT monitorability

robert mccarthy

robert mccarthy

Open-weight training practices and implications for CoT monitorability

Early Signs of Steganographic Capabilities in Frontier LLMs

Can LLMs learn Steganographic Reasoning via RL?

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Introduction

Abstract

Introduction