This research was completed for London AI Safety Research (LASR) Labs 2025 by Jennifer Za, Julija Bainiaskina, Nikita Ostrovsky and Tanush Chopra. The team was supervised by Victoria Krakovna (Google DeepMind). Find out more about the programme and express interest in upcoming iterations here. Introduction Many proposals for controlling misaligned...
As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...
We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course offers a concise and accessible introduction to AI alignment, consisting of short recorded talks and exercises (75 minutes total) with an accompanying slide deck and exercise workbook. It...
After 7 years at Deep End (and 4 more years in other group houses before that), Janos and I have moved out to live near a school we like and some lovely parks. The life change is bittersweet - we will miss living with our friends, but also look forward...
Public discussions about catastrophic risks from general AI systems are often derailed by using the word “intelligence”. People often have different definitions of intelligence, or associate it with concepts like consciousness that are not relevant to AI risks, or dismiss the risks because intelligence is not well-defined. I would advocate...
Update: The original title "DeepMind alignment team's strategy" was poorly chosen. Some readers seem to have interpreted the previous title as meaning that this was everything that we had thought about or wanted to say about an "alignment plan", which is an unfortunate misunderstanding. We simply meant to share slides...
Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will...