David Africa

Your Model Organisms Might Be Fried

by Daniel Tan, J Bostock, draganover, ma-rmartinez, sidbaines, and David Africa

Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘model intentions’, unifying insights from personas and other empirical evidence. This is an informal research note that has come out of the first 2-3 weeks of exploratory work. In this post, we’ll outline...

Jun 1886

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

by Alan Cooney, David Africa, and Geoffrey Irving

TL;DR. * Lie detectors for LLMs could be valuable for auditing and monitoring. * But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar. * We train 13 reasoning...

Jun 1730

Several frontier models are substantially prefill aware

by yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor, and RobertKirk

This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes may not represent the all-things-considered view of the entire team. Link to paper: https://arxiv.org/abs/2606.12747 TL;DR: * We provide more conceptual grounding and...

Jun 1759

Failing to Ragebait the New Gemma

by Neil Shah, David Africa, and arav-dhoot

This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and...

Jun 1130

Two More Methods for Consistency Training and Some New Ways to Apply It

Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa. This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa. TL;DR * We introduce...

Jun 518

LURE: Alignment Evaluations to Reduce Evaluation Awareness

by Igor Ivanov and David Africa

TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment evals by replaying realistic conversations and appending a safety-relevant test at the end, rather than building evaluation scenarios from...

Jun 226

Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training

This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and...

May 1944

David Africa

David Africa

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Bringing More Expertise to Bear on Alignment

Your Model Organisms Might Be Fried

David Africa

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Bringing More Expertise to Bear on Alignment

Your Model Organisms Might Be Fried

Your Model Organisms Might Be Fried

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Several frontier models are substantially prefill aware

Failing to Ragebait the New Gemma

Two More Methods for Consistency Training and Some New Ways to Apply It

LURE: Alignment Evaluations to Reduce Evaluation Awareness

Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training