bilalchughtai

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2041

SFT Drives Gemini’s Safety Properties

by Josh Engels, Arthur Conmy, bilalchughtai, and Neel Nanda

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...

Jun 1369

Building and evaluating model diffing agents

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....

Jun 1261

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL...

May 2731

Training on Documents About Monitoring Leads To CoT Obfuscation

by Reilly Haskins, bilalchughtai, and Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question...

Mar 1865

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

How Can Interpretability Researchers Help AGI Go Well?

by Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, and lewis smith

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202568

bilalchughtai

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

You should consider applying to PhDs (soon!)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

How transparent is DiffusionGemma (and why it matters)

SFT Drives Gemini’s Safety Properties

Building and evaluating model diffing agents

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Training on Documents About Monitoring Leads To CoT Obfuscation

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?