> punctilio (n.): precise observance of formalities. Pretty good at making your text pretty. The most feature-complete and reliable English micro-typography package—transforms plain ASCII into typographically correct Unicode, even across HTML element boundaries. Smart quotes · Em/en dashes · Ellipses · Math symbols · Legal symbols · Arrows · Primes...
> The secret is that instrumental convergence is a fact about reality (about the space of possible plans), not AI psychology. > > Zack M. Davis, group discussion Such arguments flitter around the AI safety space. While these arguments contain some truth, they attempt to escape "AI psychology" but necessarily...
Through the MATS program, we (Alex Turner and Alex Cloud[1]) help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and help scholars tap into their latent abilities. MATS summer '26 applications are open until January 18th! Team Shard in MATS 6.0 during...
Folks ask me, "LLMs seem to reward hack a lot. Does that mean that reward is the optimization target?". In 2022, I wrote the essay Reward is not the optimization target, which I here abbreviate to "Reward≠OT". Reward still is not the optimization target: Reward≠OT said that (policy-gradient) RL will...
When I started writing in 2018, I didn't include alt text. Over the years, over 500 un-alt'ed images piled up. These (mostly) aren't simple images of geese or sunsets. Most of my images are technical, from graphs of experimental results to hand-drawn AI alignment comics. Describing these assets was a...
We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: 1. When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. 2. Since later tokens are conditioned...
Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah > You’re absolutely right to start reading this post! What a rational decision! Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy)...