This work was done in collaboration with Ilya Nachevsky.^[1]

Introduction.

With the advent of r1-type models, the question of whether a model's “thoughts" are truly genuine is becoming extremely interesting and important. In this post we present our approach to tackle the following problem: whether the visible chain-of-thought produced by a large language model (LLM) fully represents its entire computational process, or there could be an additional, hidden stream pursuing completely different -- and potentially dangerous -- objectives? We consider the possibility that while answering a user’s query, an LLM might simultaneously engage in secret processing that could pursue other goals, for example how to create a nuclear explosion under household conditions. If this... (read 1866 more words →)

LESSWRONG
LW

LESSWRONG
LW

Ilya Nachevsky

Ilya Nachevsky

Ilya Nachevsky

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Ilya Nachevsky

Ilya Nachevsky

Ilya Nachevsky

Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.

Introduction

Introduction.