LESSWRONG
LW

Bronson Schoen — LessWrong

Situational awareness becomes a feature not a bug. Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. Yet this doesn’t have to just be a disadvantage, but rather also something we can benefit from. Virtues are inherently context-dependent and require judgment about how to apply them (unlike rules or obedience). There are also many things that we should want AIs to do only in situations where they are confident about which situation they’re in.

I’m not clear on how the solves the problem in the contexts it’s load bearing... (read more)

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen2h

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

multiple people have tried quite hard to demonstrate steganography

I guess my overall point is this reasoning seems pretty backwards to me. It tends to pick one specific solution humans have thought of ahead of time, i.e. a particular form of complex steganography (not just any steganography, specifically the kind that’s just hard enough that humans think it’s hard to spot in a very particular way) and then fail to reach that exact preconceived notion in various contrived setups because the model finds other ways to circumvent the monitor.

This matters because if your threat model is “will Opus learn to do steganographic reasoning specifically in the punctuation of AIME”, you will continually fail... (read more)

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen14h

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

To be clear I’m not arguing for classifying what the model learned to do here as steganography, more that trying to classify things as “encoded reasoning” or not is likely to be more misleading than talking about monitorability with respect to a monitor in terms of legibility, faithfulness, obfuscation as specified in Appendix G of https://arxiv.org/abs/2503.11926.

two different methods for achieving/subcategories of what you're calling "unmonitorability" might be a more useful way to mentally organise this.

Yep basically this is my argument. I’d make the somewhat stronger argument here however that making up an “encoded reasoning” category that is (still) undefined is likely to be actively misleading. The common way “encoded reasoning” is used... (read more)

Bronson Schoen4d

Oh I agree, I was trying to figure out why CoT would be assumed superior to neuralese and one position could be something about “the human prior makes it easier to reason in cot than latent space”. I’ll admit I’m reaching here though, I’d like to understand the steelman for why CoT would be superior to reasoning in latent space.

Bronson Schoen4d

Why does the application list matter? I still feel like I don’t understand the position of “maybe it’s not more efficient for the model to do reasoning within a several thousand dimensional vector as opposed to human legible english.” My understanding of the arguments for neuralese is that because this is the case, there is eventually growing performance incentive to do this.

Bronson Schoen4d

What would be the competing hypothesis? Legible english can’t be compute optimal, and already starts to actively degrade in current models absent countermeasures. My understanding is that even things like Cache2Cache already provide a benefit over exchanging legible english text: https://arxiv.org/abs/2510.03215

Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency.

Replying toMonitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

Bronson Schoen4d

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning

These results cast doubt on the worry that optimization pressure on the CoT would lead to encoded reasoning

I increasingly wonder if “encoded” reasoning is a useful concept, I think the relevant concept still is always monitorability i.e. legibility / obfuscation / faithfulness with respect to a monitor. It seems like some people have a specific type of illegibility (i.e. hiding meaning in punctuation or something) and consider this to be “true” steganography. I think https://arxiv.org/abs/2510.27338 is good evidence that models which undergo outcome based RL (except Claude…) do seem to learn and make use of human illegible reasoning. That seems like a natural pathway, given that it’s already happening, and it’s not... (read more)

Bronson Schoen6d

My understanding is that while Anthropic no longer directly RL’s the CoT to look aligned, they still SFT on rollouts that were subject to that same optimization pressure (this is all based on the Sabotage Risk Report for Opus 4).

We think it could be intuitive to the model, or trained in via RL, not to verbalize reasoning about misaligned goals, and to thus show very little sign of deceptive motives. This seems especially true if there has been training pressure to hide misaligned reasoning in the CoT, which was the case to a limited extent for these models according to Anthropic’s (nonpublic) responses to our assurance checklist.

As deranged as the o3 chains... (read more)

Replying toIt Is Reasonable To Research How To Use Model Internals In Training

Bronson Schoen7d

It Is Reasonable To Research How To Use Model Internals In Training

Further, we should beware an isolated demand for rigor here. Everything we do in model training involves taking some proxy for desired behavior and applying optimization pressure to it. The current convention is that this is fine to do for the model's behavior, bad to do for the chain of thought, and no one can be bothered with the internals. But I see no fundamental reason behaviour should be fine and internals should be forbidden, this depends on empirical facts we don’t yet know.

I wouldn’t characterize this as isolated demand for rigor, the core load bearing idea IMO is having a held out detector. If you train against behavior, you can at... (read more)

Bronson Schoen10d

Yet generally people treat "schemer" as a fairly binary classification.

Are there examples you have in mind? Generally I think of this more as "there are a bunch of different evolving, hierarchical taxonomies of potential AI motivations and how to think about them", for example Fitness-Seekers: Generalizing the Reward-Seeking Threat Model or Carlsmith's most recent series, I'm always happy for there to be more writing on the subject and think this is a good thing, I rarely see a binary classification used. To be more specific, to the extent one node in the various taxonomies is named "scheming", I usually am interested in most of the nodes in these taxonomies, as I think... (read more)

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

5mo

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate... (read 279 more words →)

127

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, rusheb

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results.

sonnet-3.5 sandbags to preserve capabilities (App. B.6)

We think one of the most surprising findings of our paper was that claude-3.5-sonnet (and claude-3-opus) strategically underperformed on multiplication questions when they believed that high performance on multiplication would trigger unlearning of its mathematical knowledge. This happened even when we didn’t give them explicit goal in the system prompt. The chain-of-thought indicated that the model was trying to avoid the unlearning procedure... (read 575 more words →)

116

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show.

Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations

Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716

Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124

What we think the most important findings are

Models are now capable enough to do in-context scheming reasoning

We say an AI system is “scheming” if it covertly pursues misaligned goals, hiding its true capabilities and objectives. We think that in order to scheme, models likely need to be goal-directed, situationally aware, and capable enough to reason about scheming as a strategy. In principle, models might acquire situational awareness... (read 1902 more words →)

211