Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we’ve evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We’ve also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We’re part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It’s been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don’t pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms b

222Aug 20, 2024

Rohin Shah

Message

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

17025

5719

207

2325

11y

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah > You’re absolutely right to start reading this post! What a rational decision! Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy)...

Nov 4, 202553

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time)....

Jul 15, 2025167

Evaluating and monitoring for AI scheming

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202560

Google DeepMind: An Approach to Technical AGI Safety and Security

We have written a paper on our approach to technical AGI safety and security. This post is primarily a copy of the extended abstract, which summarizes the paper. I also include the abstract and the table of contents. See also the GDM blogpost and tweet thread. Artificial General Intelligence (AGI)...

Apr 5, 202573

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which...

Mar 26, 2025116

AGI Safety & Alignment @ Google DeepMind is hiring

The AGI Safety & Alignment Team (ASAT) at Google DeepMind (GDM) is hiring! Please apply to the Research Scientist and Research Engineer roles. Strong software engineers with some ML background should also apply (to the Research Engineer role). Our initial batch of hiring will focus more on hiring engineers, but...

Feb 17, 2025103

A short course on AGI safety from the GDM Alignment team

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course offers a concise and accessible introduction to AI alignment, consisting of short recorded talks and exercises (75 minutes total) with an accompanying slide deck and exercise workbook. It...

Feb 14, 2025105

Load More (7/234)

LESSWRONG
LW

LESSWRONG
LW

Rohin Shah

Rohin Shah

Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Rohin Shah

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Evaluating and monitoring for AI scheming

Google DeepMind: An Approach to Technical AGI Safety and Security

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

AGI Safety & Alignment @ Google DeepMind is hiring

A short course on AGI safety from the GDM Alignment team

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Evaluating and monitoring for AI scheming

Google DeepMind: An Approach to Technical AGI Safety and Security

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

AGI Safety & Alignment @ Google DeepMind is hiring

A short course on AGI safety from the GDM Alignment team