Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not
TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own. Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side. The Debate: Alignment by Default? There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence. On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that an iterative process where each AI generation helps align the next could carry us safely to superintelligence. On the other side, Evan Hubinger of Anthropic has argued that while current models like Claude appear reasonably aligned, substantial challenges remain. The outer alignment problem, overseeing systems smarter than you, gets harder as models get smarter and we ask them to solve harder problems. And the inner alignment problem, ensuring models have aligned goals and not just aligned behavior hiding unaligned goals, remains unsolved. Models can exhibit alignment faking and reward hacking even today.[1] My own beliefs fall somewhere between these positions. I don
Fair enough. Thanks for the clarification and for taking the time to replicate this.