AlexMeinke

Physics of RL: Toy scaling laws for the emergence of reward-seeking

TL;DR: * When is or isn't reward the optimization target? I use a mathematical toy model to reason about when RL should select for reward-seeking reasoning as opposed to behaviors that achieve high reward without thinking about reward. * Hypothesis: More diverse RL increases the likelihood that reward-seeking emerges. *...

Mar 4106

Stress Testing Deliberative Alignment for Anti-Scheming Training

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Ablations for “Frontier Models are Capable of In-context Scheming”

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results. sonnet-3.5 sandbags to preserve...

Dec 17, 2024116

Frontier Models are Capable of In-context Scheming

This is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system...

Dec 5, 2024211

Training AI agents to solve hard problems could lead to Scheming

TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks...

Nov 19, 202473

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions). We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base...

Jul 8, 2024109

Apollo Research 1-year update

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....

May 29, 202493

LESSWRONG
LW

LESSWRONG
LW

AlexMeinke

AlexMeinke

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

AlexMeinke

Frontier Models are Capable of In-context Scheming

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Physics of RL: Toy scaling laws for the emergence of reward-seeking

Stress Testing Deliberative Alignment for Anti-Scheming Training

Ablations for “Frontier Models are Capable of In-context Scheming”

Frontier Models are Capable of In-context Scheming

Training AI agents to solve hard problems could lead to Scheming

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Apollo Research 1-year update