Here are the 2025 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people who had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective. (This is similar to the 2024 edition here.)
Bangers on multiple dimensions
★★★ You can measure time horizon, and it grows predictably-ish (Measuring AI Ability to Complete Long Tasks)
★★★ Black-box techniques work better than you think + you can (and should) baseline interp techniques against them + you can check that an LLM “really” internalized misaligned properties via good generalization tests + many interp techniques don’t work that well (Auditing language models for hidden objectives, Eliciting Secret Knowledge from Language Models, Evaluating honesty and lie detection techniques on a diverse suite of dishonest models)
★★★ “Evil behavior” generalizes far + You can get far generalization beyond just “evil behavior”, and even from prod-like data + training against emergent misalignment can just make it more localized + inoculation prompting helps a lot against it (Emergent misalignment, Weird Generalization, Natural emergent misalignment from prod RL, Thought Crimes, Training a reward hacker despite perfect labels)
Important ideas - the contribution I find most important is the idea
★★★ Secret loyalties and other forms of power concentration might be a big deal (How a Small Group Could Use AI to Seize Power, Gradual Disempowerment (arguments about power concentration are the ones I find the most convincing, but the essay makes stronger claims))
★★ The best prioritized list of AI control threats (Prioritizing threats for AI control)
★★ The future could come soon and would likely be dangerous (AI 2027)
★★ You could probably monitor compute enough to slow down AI development a lot (An International Agreement to Prevent the Premature Creat