We’re running ControlConf in Berkeley on April 18-19. It's a two-day conference on AI control: the study of reducing risks from misalignment through safeguards that work even when AI models are trying to undermine them. Since the last ControlConf (Feb 2025), AI agents have gotten way better. We’re approaching the...
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I think it came out pretty well—we discussed a bunch of interesting and underdiscussed topics and I’m glad to have a public record of a bunch...
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask. Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is...
tl;dr: A heavily-monitored internally deployed AI with sensitive access (e.g. to model weights) could bypass safety controls by asking a weakly-monitored external API version of itself (or another AI) what to do. It could then use its privileged access to execute instructions that the external API generated under less monitoring....
Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of...
[I wrote this blog post as part of the Asterisk Blogging Fellowship. It's substantially an experiment in writing more breezily and concisely than usual, and on a broader topic. Let me know how you feel about the style.] Literally since the adoption of writing, people haven’t liked the fact that...
I listened to "If Anyone Builds It, Everyone Dies" today. I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it. Even...