Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘model intentions’, unifying insights from personas and other empirical evidence. This is an informal research note that has come out of the first 2-3 weeks of exploratory work. In this post, we’ll outline...
TL;DR. * Lie detectors for LLMs could be valuable for auditing and monitoring. * But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar. * We train 13 reasoning...
This blog post discusses work in a recently-published paper. However, this blogpost was primarily written by Parv Mahajan and Andy Wang, and several of the more speculative takes may not represent the all-things-considered view of the entire team. Link to paper: https://arxiv.org/abs/2606.12747 TL;DR: * We provide more conceptual grounding and...
This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and...
Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa. This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa. TL;DR * We introduce...
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment evals by replaying realistic conversations and appending a safety-relevant test at the end, rather than building evaluation scenarios from...
This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and...