CallumMcDougall

Six (and a half) intuitions for KL divergence

KL-divergence is a topic which crops up in a ton of different places in information theory and machine learning, so it's important to understand well. Unfortunately, it has some properties which seem confusing at a first pass (e.g. it isn't symmetric like we would expect from most distance measures, and it can be unbounded as we take the limit of probabilities going to zero). There are lots of different ways you can develop good intuitions for it that I've come across in the past. This post is my attempt to collate all these intuitions, and try and identify the underlying commonalities between them. I hope that for everyone reading this, there will be at least one that you haven't come across before and that improves your overall understanding! One other note - there is some overlap between each of these (some of them can be described as pretty much just rephrasings of others), so you might want to just browse the ones that look interesting to you. Also, I expect a large fraction of the value of this post (maybe >50%) comes from the summary, so you might just want to read that and skip the rest! Summary 1. Expected surprise > DKL(P||Q)= how much more surprised you expect to be when observing data with distribution P, if you falsely believe the distribution is Q vs if you know the true distribution 2. Hypothesis Testing > DKL(P||Q)= the amount of evidence we expect to get for P over Q in hypothesis testing, if P is true. 3. MLEs > If P is an empirical distribution of data, DKL(P||Q) is minimised (over Q) when Q is the maximum likelihood estimator for P. 4. Suboptimal coding > DKL(P||Q)= the number of bits we're wasting, if we try and compress a data source with distribution P using a code which is actually optimised for Q (i.e. a code which would have minimum expected message length if Q were the true data source distribution). 5A. Gambling games - beating the house > DKL(P||Q)= the amount (in log-space) we can win from a casino game, if we know the true

179Oct 12, 2022

Induction heads - illustrated

133Jan 2, 2023

A Pragmatic Vision for Interpretability

131Dec 1, 2025

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

116Mar 26, 2025

CallumMcDougall

Message

2377

346

ARENA 8.0 - Call for Applicants

TL;DR: We're excited to announce the eighth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA 8.0 will...

Feb 2029

Announcing Gemma Scope 2

TLDR * The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family * Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here [1] * Key features of this relative...

Dec 22, 202594

Transmitting Misalignment with Subliminal Learning via Paraphrasing

TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent...

Dec 17, 202538

How Can Interpretability Researchers Help AGI Go Well?

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202566

A Pragmatic Vision for Interpretability

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025131

ARENA 7.0 - Call for Applicants

TL;DR: We're excited to announce the seventh iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA 7.0 will...

Sep 30, 202527

ARENA 6.0 - Call for Applicants

TL;DR: We're excited to announce the sixth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA will be...

Jun 4, 202526

Load More (7/44)

LESSWRONG
LW

LESSWRONG
LW

CallumMcDougall

CallumMcDougall

CallumMcDougall

Six (and a half) intuitions for KL divergence

Induction heads - illustrated

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

CallumMcDougall

ARENA 8.0 - Call for Applicants

Announcing Gemma Scope 2

Transmitting Misalignment with Subliminal Learning via Paraphrasing

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

ARENA 7.0 - Call for Applicants

ARENA 6.0 - Call for Applicants

ARENA 8.0 - Call for Applicants

Announcing Gemma Scope 2

Transmitting Misalignment with Subliminal Learning via Paraphrasing

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

ARENA 7.0 - Call for Applicants

ARENA 6.0 - Call for Applicants

Six (and a half) intuitions for KL divergence

Induction heads - illustrated

A Pragmatic Vision for Interpretability

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)