Steganography via internal activations is already possible in small language models — a potential first step toward persistent hidden reasoning.
Introduction In this post we continue to work on the project, started in our previous post, of exploring hidden reasoning of large language models (LLM). We ask whether an LLM, when given both a “general” and a “secret” question, can generate its public answer to the general question while simultaneously...