TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...
This is the abstract and introduction of our new paper. Links: 📜 Paper, 🐦 Twitter thread, 🌐 Project page, 💻 Code Authors: Jan Betley*, Jorio Cocola*, Dylan Feng*, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans (* Equal Contribution) You can train an LLM only on good behavior and implant...
Twitter | ArXiv Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared...
This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs". TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent...
This post describes concept poisoning, a novel LLM evaluation technique we’ve been researching for the past couple months. We’ve decided to move to other things. Here we describe the idea, some of our experiments, and the reasons for not continuing. Contributors: Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Anna...
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns...
OpenAI did great work studying emergent misalignment, where models become generally misaligned after narrow training. They found that the assistant has a toxic, misaligned persona. The model discusses having a "bad boy persona" in the chain-of-thought (CoT). They show a toxic persona feature being activated in the model's internals. So...