Six (and a half) intuitions for KL divergence

CallumMcDougall

168 Six (and a half) intuitions for KL divergence

12th Oct 2022

12 min read

168

KL-divergence is a topic which crops up in a ton of different places in information theory and machine learning, so it's important to understand well. Unfortunately, it has some properties which seem confusing at a first pass (e.g. it isn't symmetric like we would expect from most distance measures, and it can be unbounded as we take the limit of probabilities going to zero). There are lots of different ways you can develop good intuitions for it that I've come across in the past. This post is my attempt to collate all these intuitions, and try and identify the underlying commonalities between them. I hope that for everyone reading this, there will be at least one that you haven't come across before and that improves your overall understanding!

One other note - there is some overlap between each of these (some of them can be described as pretty much just rephrasings of others), so you might want to just browse the ones that look interesting to you. Also, I expect a large fraction of the value of this post (maybe >50%) comes from the summary, so you might just want to read that and skip the rest!

Summary

1. Expected surprise

how much more surprised you expect to be when observing data with distribution $P$ , if you falsely believe the distribution is $Q$ vs if you know the true distribution

2. Hypothesis Testing

$D_{K L} (P | | Q) =$ the amount of evidence we expect to get for $P$ over $Q$ in hypothesis testing, if $P$ is true.

3. MLEs

If $P$ is an empirical distribution of data, $D_{K L} (P | | Q)$ is minimised (over $Q$ ) when $Q$ is the maximum likelihood estimator for $P$ .

4. Suboptimal coding

$D_{K L} (P | | Q) =$ the number of bits we're wasting, if we try and compress a data source with distribution $P$ using a code which is actually optimised for $Q$ (i.e. a code which would have minimum expected message length if $Q$ were the true data source distribution).

5A. Gambling games - beating the house

$D_{K L} (P | | Q) =$ the amount (in log-space) we can win from a casino game, if we know the true game distribution is $P$ but the house incorrectly believes it to be $Q$ .

5B. Gambling games - gaming the lottery

$D_{K L} (P | | Q) =$ the amount (in log-space) we can win from a lottery if we know the winning ticket probabilities $P$ and the distribution of ticket purchases $Q$ .

6. Bregman divergence

$D_{K L} (P | | Q)$ is in some sense a natural way of measuring of how far $Q$ is from $P,$ if we are using the entropy of a distribution to capture how far away it is from zero (analogous to how $| | x - y | |_{2}$ is a natural measure of the distance between vectors $x$ and $y$ , if we're using $| | x | |_{2}$ to capture how far the vector $| | x | |_{2}$ is from zero).

Common theme for most of these:

$D_{K L} (P | | Q) =$ measure of how much our model $Q$ differs from the true distribution $P$ . In other words, we care about how much $P$ and $Q$ differ from each other in the world where P is true, which explains why KL-div is not symmetric.

1. Expected Surprise

For a random variable $X$ with probability distribution $P (X = x) = p_{x}$ , the surprise (or surprisal) is defined as $I_{P} (x) = - ln p_{x}$ . This is motivated by some simple intuitive constraints we would like to have on any notion of "surprise":

An event with probability $1$ has no surprise
Lower-probability events are strictly more surprising
Two independent events are exactly as surprising as the sum of those events' surprisal when independently measured

In fact, it's possible to show that these three considerations fix the definition of surprise up to a constant multiple.

From this, we have another way of defining entropy - as the expected surprisal of an event:

H (X) = - \sum x p_{x} ln p_{x} = E_{P} [I_{P} (X)]

Now, suppose we (erroneously) believed the true distribution of $X$ to be $Q$ , rather than $P$ . Then the expected surprise of our model (taking into account that the true distribution is $P$ ) is:

E_{P} [I_{Q} (X)] = - \sum x p_{x} ln q_{x}

and we now find that:

D_{K L} (P | | Q) = \sum x p_{x} (ln p_{x} - ln q_{x}) = E_{P} [I_{Q} (X) - I_{P} (X)]

In other words, KL-divergence is the difference between the expected surprise of your model, and the expected surprise of the correct model (i.e. the model where you know the true distribution $P$ ). The further apart $Q$ is from $P$ , the worse the model $Q$ is for $P$ , i.e. the more surprised it should expect to get by reality.

Furthermore, this explains why $D_{K L} (P | | Q)$ isn't symmetric, e.g. why it blows up when $p_{x} ≫ q_{x} \approx 0$ but not when $q_{x} ≫ p_{x} \approx 0$ . In the former case, your model is assigning very low probability to an event which might happen quite often, hence your model is very surprised by this. The latter case doesn't have this property, and there's no equivalent story you can tell about how your model is frequently very surprised.^[1]

2. Hypothesis Testing

Suppose you have two hypotheses: a null hypothesis $H_{0}$ which says that $X \sim P$ , and an alternative hypothesis $H_{1}$ which says that $X \sim Q$ . Suppose the null is actually true. A natural hypothesis test is the likelihood ratio test, i.e. you reject $H_{0}$ if the observation $X$ is in the critical region:

R = {x : p_{x} / q_{x} \leq λ}

for some constant $λ$ which determines the size of the test. Another way of writing this is:

R = {x : ln p_{x} - ln q_{x} \leq μ}

We can interpret the value $ln p_{x} - ln q_{x}$ as (a scalar multiple of^[2]) the bits evidence we get for $H_{0}$ over $H_{1}$ . In other words, if $x$ happens twice as often under distribution $P$ than distribution $Q$ , then the observation $X = x$ is a single bit of evidence for $H_{0}$ over $H_{1}$ .

$D_{K L}$ is (a scalar multiple of) the expected bits of evidence we get for $H_{0}$ over $H_{1}$ , where the expectation is over the null hypothesis $X \sim P$ . The closer $P$ and $Q$ are, the more we should expect it to be hard to distinguish between them - i.e. when $P$ is true, we shouldn't expect reality to provide much evidence for $P$ rather than $Q$ being true.

3. MLEs

This one is a bit more maths-heavy than the others, so ymmv on how enlightening it is!

Suppose ${^P}_{n}$ is the empirical distribution of data $x_{1}, . . ., x_{N}$ , which are each iid with distribution $P$ , and $Q_{θ}$ is a statistical model parameterised by $θ$ . Our likelihood function is:

L ({^P}_{n}; Q_{θ}) = \frac{1}{N} N \sum i = 1 ln Q_{θ} (x_{i})

By the law of large numbers, $\frac{1}{N} \sum_{i = 1}^{N} ln Q_{θ} (x_{i}) \to \sum_{x} P (x) ln Q_{θ} (x)$ almost surely. This is the cross entropy of $P$ and $Q_{θ}$ . Also note that if we subtract this from the entropy of $P$ , we get $D_{K L} (P | | Q_{θ})$ . So minimising the cross entropy over $θ$ is equivalent to maximising $D_{K L} (P | | Q_{θ})$ .

Our maximum likelihood estimator $θ^{*}$ is the parameter which maximises $L ({^P}_{n}; Q_{θ})$ , and we can use some statistical learning theory plus a lot of handwaving to argue that $θ^{*} \to argmin D_{K L} (P | | Q_{θ})$ (i.e. we've swapped around the limit and argmin operators). In other words, maximum likelihood estimation is equivalent to minimising KL-divergence. If $D_{K L} (P | | Q)$ is large, this suggests that $Q$ will not be a good model for data generated from the distribution $P$ .

4. Suboptimal Coding

Source coding is a huge branch of information theory, and I won't go through all of that in this post. There are several online resources that do a good job of explaining it. To recap the key idea that will be important here:

If you're trying to transmit data from some distribution over a binary channel, you can assign particular outcomes to strings of binary digits in a way which minimises the expected number of digits you have to send. For instance, if you have three possible events with probability (0.8, 0.1, 0.1), then it makes sense to use a code like (0, 10, 11) for this sequence, because you'll find yourself sending the shorter codes with higher probability.
In the limit for a large number of possible values for $X$ (provided some other properties hold), the optimal code^[3] will represent outcome $x$ with a binary string of length $L_{x} = - {log}_{2} p_{x}$ .

From this, the intuition for KL divergence pops neatly out. Suppose you erroneously believed that $X \sim Q$ , and you designed an encoding that would be optimal in this case. The expected number of bits you'll have to send per message is:

- \sum x p_{x} {log}_{2} q_{x}

and we can immediately see that KL-divergence is (up to a scale factor) the difference in expected number of bits per event you'll have to send with this suboptimal code, vs the number you'd expect to send if you knew the true distribution and could construct the optimal code. The further apart $P$ and $Q$ are, the more bits you're wasting on average by not sending the optimal code. In particular, if we have a situation like $p_{x} ≫ q_{x} \approx 0$ , this means our code (which is optimised for $Q$ ) will assign a very long codeword to outcome $x$ since we don't expect it to occur often, and so we'll be wasting a lot of message space by frequently having to use this codeword.

5A. Gambling Games - Beating the House

Suppose you can bet on the outcome of some casino game, e.g. a version of a roulette wheel with nonuniform probabilities. First, imagine the house is fair, and pays you $1 / p_{x}$ times your original bet if you bet on outcome $x$ (this way, any bet has zero expected value: because betting $c_{x}$ on outcome $x$ means you expect to get $p_{x} \times c_{x} / p_{x} = c_{x}$ returned to you). Because the house knows exactly what all the probabilities are, there's no way for you to win money in expectation.

Now imagine the house actually doesn't know the true probabilities $P$ , but you do. The house's mistaken belief is $Q$ , and so they pay people $1 / q_{x}$ for event $x$ even though this actually has probability $p_{x}$ . Since you know more than them, you should be able to profit from this state of affairs. But how much can you make?

Suppose you have $1 to bet. You bet $c_{x}$ on outcome $x$ , so $\sum_{x} c_{x} = 1$ . Let $W$ be your expected winnings. It is more natural to talk about log winnings, because this describes how your wealth grows proportionally over time. Your expected log winnings are:

E [ln W] = \sum x p_{x} \times ln (c_{x} / q_{x})

It turns out that, once you perform a simple bit of optimisation using the Lagrangian:

L (λ; B) = E [ln W] + λ (1 - \sum x c_{x})

then you find the optimal betting strategy is $c_{x} = p_{x}$ (this is left as an exercise to the reader!). Your corresponding expected winnings are:

E [ln W] = \sum x p_{x} \times ln (p_{x} / q_{x}) = D (P | | Q)

in other words, the KL divergence represents the amount you can win from the casino by exploiting the difference between the true probabilities $P$ and the house's false beliefs $Q$ . The closer $P$ and $Q$ are, the harder it is to profit from your extra knowledge.

Once again, this framing illustrates the lack of symmetry in the KL-divergence. If $p_{x} ≫ q_{x}$ , this means the house will massively overpay you when event $x$ happens, so the obvious strategy to exploit this is to bet a lot of money on $x$ (and $D (P | | Q)$ will correspondingly be very large). If $q_{x} ≫ p_{x}$ , there is no corresponding way to exploit this (except to the extent that this suggests we might have $p_{y} ≫ q_{y}$ for some different outcome $y$ ).

5B. Gambling Games - Gaming the Lottery

This is basically the same as (5A), but it offers a slightly different perspective. Suppose a lottery exists for which people can buy tickets, and the total amount people spend on tickets is split evenly between everyone who bought a ticket with the winning number (realistically the lottery organisers would take some spread, but we assume this amount is very small). If every ticket is bought the same number of times, then there's no way to make money in expectation. But suppose people have a predictable bias (e.g. buying round numbers, or numbers with repeated digits) - then you might be able to make money in expectation by buying the less-frequently-bought tickets, because when you win you generally won't have as many people you'll have to split the pot with.

If you interpret $Q$ as the distribution of people buying each ticket (which is known to you), and $P$ is the true underlying distribution of which ticket pays out (also known), then this example collapses back into the previous one - you can use optimisation to find that the best way to purchase tickets is in proportion to $P$ , and the KL-divergence is equal to your expected log winnings.

To take this framing further, let's consider situations where $Q$ is not known to you on a per-number basis, but the overall distribution of group-sizes-per-ticket-number is known to you. For instance, in the limit of a large number of players and of numbers you can approximate the group size as a Poisson distribution. If each ticket has the same probability of paying out, then you can make $D_{K L} (U | | Q)$ profit in expectation by buying one of every ticket (where $U$ is the uniform distribution, and $Q$ is the Poisson distribution). Interestingly, this strategy of "buying the pot" is theoretically possible for certain lotteries, for instance in the Canadian 6/49 Lotto (see a paper analysing this flaw here). However, there are a few reasons this tends not to work in real life, such as:

The lottery usually takes a sizeable cut
There are lottery restrictions (e.g. ticket limits)
Buying the pool is prohibitively expensive (organising and funding a syndicate to exploit this effect is hard!)

6. Bregman Divergence

Bregman divergence is pretty complicated in itself, and I don't expect this section to be illuminating to many people (it's still not fully illuminating to me!). However, I thought I'd still leave it in because it does offer an interesting perspective.

If you wanted to quantify how much two probability distributions diverge, the first thing you might think of is taking a standard norm (e.g. $l_{2}$ ) of the difference between them. This has some nice properties, but it's also unsatisfactory for a bunch of reasons. For instance, it intuitively seems like the distance between the Bernoulli distributions with $p = 0.2$ and $p = 0$ should be larger than that between $p = 0.4$ and $p = 0.6$ .^[4]

It turns out that there's a natural way to associate any convex function $ϕ$ with a measure of divergence. Since tangents to convex functions always lie below them, we can define Bregman divergence $D_{ϕ} (x | | y)$ as the amount by which $ϕ (x)$ is greater than the estimate for it you would get by fitting a tangent line to $ϕ$ at $y$ and using it to linearly extrapolate to $x$ .

To do some quick sanity checks for Bregman divergence - if your convex function is the $l_{2}$ norm squared, then the divergence measure you get is just the squared $l_{2}$ norm of the vector between your points:

D_{| | \cdot | |_{2}^{2}} (x | | y) = | | x | |^{2} - | | y | |^{2} - 2 y^{T} (x - y) = | | x - y | |^{2}

This is basically what you'd expect - it shows you that when the $l_{2}$ norm is the natural way to measure how far away something is from zero (i.e. how large it is), then the $l_{2}$ norm of the vector between two points is the natural way to measure how far one point is from another.

Now, lets go back to the case of probability distributions. Is there any convex function which measures, in some sense, how far away a probability distribution is from zero? Well, one thing that seems natural is to say that "zero" is any probability distribution where the outcome is certain - in other words, zero entropy. And it turns out entropy is concave, so if we just take the negative of entropy then we get a convex function. Slap that into the formula for Bregman divergence and we get:

\begin{matrix} D_{- H} (P ∥ Q) & = - H (P) + H (Q) + ⟨ \nabla H (Q), P - Q ⟩ = \sum x p_{x} ln p_{x} - q_{x} ln q_{x} - \frac{\partial}{\partial q_{x}} (q_{x} ln q_{x}) (p_{x} - q_{x}) = \sum x p_{x} ln p_{x} - q_{x} ln q_{x} - (1 + ln q_{x}) (p_{x} - q_{x}) = \sum x p_{x} (ln p_{x} - ln q_{x}) = D_{K L} (P ∥ Q) \end{matrix}

There's no lightning-bolt moment of illumination from this framing. But it's still interesting, because it shows that different ways of measuring the divergence between two points can be more natural than others, depending on the space that we're working in, and what it represents. Euclidean distance between two points is natural in probability space, when zero is just another point in that space. But when working on the probability simplex, with entropy being our chosen way to measure a probability distribution's "difference from zero", we find that $D_{K L}$ is in some sense the most natural choice.

Final Thoughts

Recapping these, we find that $D_{K L} (P | | Q)$ being large indicates:

Your model $Q$ will be very surprised by reality $P$
You expect to get a lot of evidence in favour of hypothesis $P$ over $Q$ , if $P$ is true
$Q$ is a poor model for observed data $P$
You would be wasting a lot of message content if you tried to encode $P$ optimally while falsely thinking the distribution was $Q$
You can make a lot of money in betting games where other people have false beliefs $Q$ , but you know the true probabilities $P$
(this one doesn't have as simple a one-sentence summary!)

Although (4) might be the most mathematically elegant, I think (1) cuts closest to a true intuition for $D$ .

To summarise what all of these framings have in common:

$D (P | | Q) =$ measure of how much our model $Q$ differs from the true distribution $P$ . In other words, we care about how much $P$ and $Q$ differ from each other in the world where P is true, which explains why KL-div is not symmetric.

To put this last point another way, $D (P | | Q)$ "doesn't care" when $q_{x} ≫ p_{x}$ (assuming both probabilities are small), because even though our model is wrong, reality doesn't frequently show us situations in which our model fails to match reality. But if $p_{x} ≫ q_{x}$ then the outcome $x$ will occur more frequently than we expect, consistently surprising our model and thereby demonstrating the model's inadequacy.

^{^}
Note that the latter case might imply the former case, e.g. if $1 \approx q_{x} ≫ p_{x} \approx 0$ then we are actually also in the former case, since $p_{\neg x} ≫ q_{\neg x} \approx 0$ . But this doesn't always happen; it is possible to have asymmetry here. For instance, if P = (0.1, 0.9) and Q = (0.01, 0.99), then we are in the former case but not the latter. If $P$ is true, then 10% of the time model $Q$ is extremely surprised, because an event happens that it ascribes probability 1% to - which is why $D_{K L} (P | | Q)$ is very large. But if $Q$ is true, reality presents model $P$ with no surprises as large as this - hence $D_{K L} (Q | | P)$ is not as large.
^{^}
The scalar multiple part is because we're working with natural log, rather than base 2.
^{^}
Specifically, the optimal decodable code - in other words, your set of codewords needs to have the property that you could string together any combination of them and it's possible to decipher which codewords you used. For instance, (0, 10, 11) has this property, but (0, 10, 01) doesn't, because the string 010 could have been produced from 0 + 10 or 01 + 0.
^{^}
One way you could argue that a distance measure should have this property is to observe that the former two distributions have much lower variance than the latter two. So if you observe a distribution which is either $p = 0$ or $p = 0.2$ , you should expect it to take much less time to tell which of the two distributions you're looking at than if you were trying to distinguish between $p = 0.4$ and $p = 0.6$ .

Information theoryProbability & StatisticsMachine Learning (ML)World Modeling

Frontpage

168

Mentioned in

125The Standard Analogy

106SAE reconstruction errors are (empirically) pathological

71Six (and a half) intuitions for SVD

57Voting Results for the 2022 Review

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:26 PM

[-]tailcalled2y130

I also saw a good intuitive example of the asymmetry once. If you've got a bimodal distribution and a monomodal distribution that lies at one of the peaks of the bimodal distribution, then the KL-divergence will be low when P is the monomodal distribution and Q is the bimodal distribution, while the KL-divergence will be high when P is the bimodal distribution and Q is the monomodal distribution.

[-]CallumMcDougall2y191

Oh yeah, I really like this one, thanks! The intuition here is again that a monomodal distribution is a bad model for a bimodal one because it misses out on an entire class of events, but the other way around is much less bad because there's no large class of events that happen in reality but that your model fails to represent.

For people reading here, this post discusses this idea in more detail. The image to have in mind is this one:

[-]jsd2y90

Thanks for this post! Relatedly, Simon DeDeo had a thread on different ways the KL-divergence pops up in many fields:

Kullback-Leibler divergence has an enormous number of interpretations and uses: psychological, epistemic, thermodynamic, statistical, computational, geometrical... I am pretty sure I could teach an entire graduate seminar on it.
Psychological: an excellent predictor of where attention is directed. http://ilab.usc.edu/surprise/
Epistemic: a normative measure of where you ought to direct your experimental efforts (maximize expected model-breaking) http://www.jstor.org/stable/4623265
Thermodynamic: a measure of work you can extract from an out-of-equlibrium system as it relaxes to equilibrium.
Statistical: too many to count, but (e.g.) a measure of the failure of an approximation method. https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Computational (machine learning): a measure of model inefficiency—the extent to which it retains useless information. https://arxiv.org/abs/1203.3271
Computational (compression): the extent to which a compression algorithm designed for one system fails when applied to another.
Geometrical: the (non-metric!) connection when one extends differential geometry to the probability simplex.
Biological: the extent to which subsystems co-compute.
Machine learning: the basic loss function for autoencoders, deep learning, etc. (people call it the "cross-entropy")
Algorithmic fairness. How to optimally constrain a prediction algorithm when ensuring compliance with laws on equitable treatment. https://arxiv.org/abs/1412.4643
Cultural evolution: a metric (we believe) for the study of individual exploration and innovation tasks... https://www.sciencedirect.com/science/article/pii/S0010027716302840 …
Digital humanism: Kullback-Leibler divergence is related to TFIDF, but with much nicer properties when it comes to coarse-graining. (The most distinctive words have the highest partial-KL when teasing apart documents; stopwords have the lowest) http://www.mdpi.com/1099-4300/15/6/2246
Mutual information: Well, it's a special case of Kullback-Leibler—the extent to which you're surprised by (arbitrary) correlations between a pair of variables if you believe they're independent.
Statistics: it's the underlying justification for the Akiake Information Criterion, used for model selection.
Philosophy of mind: It’s the “free energy” term in the predictive brain account of perception and consciousness. See Andy Clark’s new book or https://link.springer.com/article/10.1007%2Fs11229-017-1534-5

[-]CallumMcDougall2y10

This is awesome, I love it! Thanks for sharing (-:

[-]Joseph Van Name2y61

The Radon-Nikodym theorem allows us to define the Kullback-Leibler divergence in a more general setting. Even though this setting is more abstract, in this abstract setting it is clear how the KL-divergence is a way to define relative entropy. The abstract setting limits our options to only the correct option.

Suppose that is the probability density function of a random variable supported on $[0, 1]$ . Then we define the continuous entropy of $f$ as $\int_{0}^{1} - f (x) \cdot log (f (x)) d x$ .

Suppose that we are given two probability measures $μ, ν$ on the Borel subsets of $[0, 1]$ and we know that $μ$ is the Lebesgue probability measure on $[0, 1]$ . Then is there any way to abstractly define the continuous entropy of $ν$ that only refers to the abstract measures $μ, ν$ without referring to $f$ or by recovering $f$ from the abstract properties of $μ, ν$ ? Yes there is. The Radon-Nikodym theorem allows us to recover $f$ from the measure theoretic properties of $μ, ν$ and it therefore allows us to recover the continuous entropy.

Suppose that $(X, M)$ is a $σ$ -algebra. Let $μ, ν$ be two measures on $X$ . We say that $μ$ is absolutely continuous with respect to $ν$ if $ν (A) = 0 \Rightarrow μ (A) = 0$ , and we write $μ ≪ ν$ in this case. Recall that a measure $μ$ is $σ$ -finite if there are $A_{n} \in M$ for each natural number $n$ such that $μ (A_{n}) < \infty$ for each $n$ and $X = ⋃_{n = 0}^{\infty} μ (A_{n})$ .

Theorem: (Radon-Nikodym theorem) Suppose that $μ ≪ ν$ and $μ, ν$ are both $σ$ -finite measures. Then there exists a positive measurable function $f : X \to [0, \infty)$ where $μ (A) = \int_{A} f (x) d ν (x)$ for each $A \in M$ .

The function $f$ is called the Radon-Nikodym derivative of $μ$ with respect to $ν$ and is denoted by $f = \frac{d μ}{d ν}$ . The Radon-Nikodym derivative is unique up to a set of measure zero.

We therefore define the Kullback-Kiebler divergence as $D_{K L} (μ | | ν) = \int_{X} \frac{d μ}{d ν} \cdot log (\frac{d μ}{d ν}) d ν = \int_{X} log (\frac{d μ}{d ν}) d μ$ which coincides with the continuous entropy without the $- 1$ factor.

Now, the assumption that $ν$ is isomorphic to the Lebesgue probability measure on $[0, 1]$ is a surprisingly general assumption that works for many probability measures $ν$ .

A Boolean algebra $B$ is said to be $σ$ -complete if whenever $a_{n} \in B$ for each natural number $n$ , the set ${a_{n} : n \in N}$ has a least upper bound which we denote by $⋁_{n = 0}^{\infty} a_{n}$ .

A measure algebra is a pair $(B, μ)$ where $B$ is a $σ$ -complete Boolean algebra and $μ : B \to [0, \infty)$ is a function with $μ (x) = 0$ iff $x = 0$ and $μ (⋁_{n = 0}^{\infty} a_{n}) = \sum_{n = 0}^{\infty} μ (a_{n})$ whenever $a_{m} \land a_{n} = 0$ for $m \neq n$ . If $(B, μ)$ is a measure algebra, then define a metric $d$ (and hence a topology and uniformity and whatnot) on $B$ by setting $d (x, y) = μ (x \oplus y)$ . Recall that a topological space is separable if it has a countable dense subset.

For example, if $M$ is the collection of Lebesgue measurable subsets of $[0, 1]$ , and $I$ is the collection of measure zero subsets of $[0, 1]$ , then the quotient Boolean algebra $M / I$ is complete and if $μ$ is the Lebesgue measure on $M / I$ , then $(M / I, μ)$ is a separable atomless probability measure algebra, and this measure algebra is unique.

Theorem: (Caratheodory) Every separable atomless probability measure algebra is isomorphic.

We have an even stronger uniqueness theorem for this measure algebra.

Theorem: Every Borel probability measure on a complete separable metric space without any atoms is isomorphic to the Lebesgue measure on the interval $[0, 1]$ .

Suppose now that $μ, ν$ are Borel probability measures on a complete separable metric space that possibly have atoms. Then there is some measure $η$ where $ν \times η$ is isomorphic to the Lebesgue probability measure. In this case, $\frac{d μ}{d ν} (x) = \frac{d (μ \times η)}{d (ν \times η)} (x, y)$ for almost all $x, y$ , so

$D_{K L} (μ \times η | | ν \times η) = \int log (\frac{d (μ \times η)}{d (ν \times η)} (x, y)) d (μ \times η) (x, y) = \int log (\frac{d μ}{d η} (x, y)) d (μ \times η) (x, y) = \int log (\frac{d μ}{d η} (x)) d μ (x)$

$= D_{K L} (μ | | ν)$ . In this case, we can pretend that $ν \times η$ is the Lebesgue measure on $[0, 1]$ and that $D_{K L} (μ \times η | | ν \times η)$ just measures the continuous entropy of $μ \times η$ with respect to the Lebesgue probabilty measure on $[0, 1]$ .

The KL-divergence also generalizes greatly. Given a function $h : [0, \infty) \to [0, \infty)$ ( $h$ is usually convex or concave), one can define another measure of similarity between probability measures such as $D_{h} (μ | | ν) = \int h (\frac{d μ}{d ν}) d ν$ known as the $h$ -divergence. One can also define a measure on $[0, \infty)$ from the Radon-Nikodym derivative by $(\frac{d μ}{d ν})_{*} ν$ where if $g : X \to Y$ and $ν$ is a measure on $X$ , then $g_{*} (ν)$ is the measure on $Y$ defined by $g_{*} (ν) (R) = ν (g^{- 1} [R])$ . The random variable $(\frac{d μ}{d ν})_{*} ν$ gives information about the similarity between the probability measures $μ, ν$ from which one can always recover the $h$ -divergence. In this case, $E [h (\frac{d μ}{d ν})_{*} ν)] = \int h (x) d (\frac{d μ}{d ν})_{*} ν (x) = \int h (\frac{d μ}{d ν} (x)) d ν (x) = D_{h} (μ | | ν)$ .

[-]Steveot2y50

Another intuition I often found useful: KL-divergence behaves more like the square of a metric than a metric.

The clearest indicator of this is that KL-divergence satisfies a kind of Pythagorean theorem established in a paper by Csiszár (1975), see https://www.jstor.org/stable/2959270#metadata_info_tab_contents . The intuition is exactly the same as for the euclidean case: If we project a point A onto a convex set S (say the projection is B), and if C is another point in the set S, then the standard Pythagorean theorem would tell us that the angle of the triangle ABC at B is larger than 90 degree, or in other words . And the same holds if we project with respect to KL divergence, and we end up having $D_{K L} (C, A) >= D_{K L} (B, A) + D_{K L} (C, B)$ .

This has implications if you think about things like sample efficiency (instead of a square root rate as usual, convergence rates with KL divergence usually behave like 1/n).

This is also reflected in the relation between KL divergence and other distances for probability measures, like total variation or Wasserstein distance. The most prominent example would be Pinsker's inequality in this regard, stating that the total variation norm between two measures is bounded by a constant times the square root of the KL-divergence between the measures.

[-]criticalpoints4mo10

This intuition--that the KL is a metric-squared--is indeed important for understanding the KL divergence. It's a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic--which is in some sense the Platonic convex function--with a convex function of your choice.

This intuition is also important for understanding Talagrand's T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.

[-]Zack_M_Davis1y36Review for 2022 Review

Rich, mostly succinct pedagogy of timeless essentials, highly recommended reference post.

The prose could be a little tighter and less self-conscious in some places. (Things like "I won't go through all of that in this post. There are several online resources that do a good job of explaining it" break the flow of mathematical awe and don't need to be said.)

[-]CallumMcDougall1y20

Thanks, really appreciate this (and the advice for later posts!)

[-][anonymous]2y30

This is nitpicky but I believe the reasoning for 3 is mistaken, mainly because of the counterintuitive claim that minimising cross-entropy over is equivalent to maximising KL divergence.

Writing the log-likelihood in the equation

\frac{1}{N} log ℓ = \frac{1}{N} N \sum i = 1 ln Q_{θ} (x_{i}),

we have the RHS tends to the negative cross entropy (by LLN), instead of just the cross entropy. Then, since $H (P) = H (P, Q) - D_{K L} (P | | Q),$ with $P$ being the fixed (true) distribution and $Q$ varying with $θ$ , we have that MLE means we minimise the cross entropy which means we minimise the KL divergence.

[-]lalaithion2y30

It would be nice to have a couple examples comparing concrete distributions Q and P and examining their KL-divergence, why it's large or small, and why it's not symmetric.

[-]CallumMcDougall2y10

I think some of the responses here do a pretty good job of this. It's not really what I intended to go into with my post since I was trying to keep it brief (although I agree this seems like it would be useful).

[-]Nina Panickssery2y20

I think this style of post is really valuable and interesting to read, thanks for putting this together

[-]CallumMcDougall2y20

Thanks, really appreciate it!

[-]Archimedes2y20

This video breaks it down nicely along the lines of what you describe as the "common theme".

https://www.youtube.com/watch?v=SxGYPqCgJWM

[-]TekhneMakre2y20

Nice. I didn't know about the hypothesis testing one (or Bregman, but I don't get that one). I wonder if one can back out another description of KL divergence in terms of mutual information from the expression of mutual information in terms of KL divergence: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Mutual_information

[-]CallumMcDougall2y10

And yeah, despite a whole 16 lecture course on convex opti I still don't really get Bregman either, I skipped the exam questions on it 😆

[-]CallumMcDougall2y10

Oh yeah, I hadn't considered that one. I think it's interesting, but the intuitions are better in the opposite direction, i.e. you can build on good intuitions for to better understand MI. I'm not sure if you can easily get intuitions to point in the other direction (i.e. from MI to $D_{K L}$ ), because this particular expression has MI as an expectation over $D_{K L}$ , rather than the other way around. E.g. I don't think this expression illuminates the nonsymmetry of $D_{K L}$ .

The way it's written here seems more illuminating (not sure if that's the one that you meant). This gets across the idea that:

$P_{(X, Y)}$ is the true reality, and $P_{X} \otimes P_{Y}$ is our (possibly incorrect) model which assumes independence. The mutual information between $X$ and $Y$ equals $D_{K L} (P_{(X, Y)} | | P_{X} \otimes P_{Y})$ , i.e. the extent to which modelling $X$ and $Y$ as independent (sharing no information) is a poor way of modelling the true state of affairs (where they do share information).

But again I think this intuition works better in the other direction, since it builds on intuitions for $D_{K L}$ to better explain MI. The arguments in the $D_{K L}$ expression aren't arbitrary (i.e. we aren't working with $D_{K L} (P | | Q)$ ), which restricts the amount this can tell us about $D_{K L}$ in general.

[-]TekhneMakre2y20

The arguments in the expression aren't arbitrary (i.e. we aren't working with $D_{K L} (P | | Q)$ ), which restricts the amount this can tell us about $D_{K L}$ in general.

Yeah, I was vaguely hoping one could phrase $P$ and $Q$ so they're in that form, but I don't see it.

[-]sgalkina4mo10

https://youtu.be/G_U8_JFyQP0?si=E63BwmX-Db63vEND - I made an animation of the casino intuition - hope you don't mind! Thanks for the post.

[-]Ethan (EJ) Watkins2y10

By the law of large numbers, almost surely. This is the cross entropy of $P$ and $Q_{θ}$ . Also note that if we subtract this from the entropy of $P$ , we get $D_{K L} (P | | Q_{θ})$ . So minimising the cross entropy over $θ$ is equivalent to maximising $D_{K L} (P | | Q_{θ})$ .

I think the cross entropy of $P$ and $Q_{θ}$ is actually $H (P, Q_{θ}) = - \sum_{x} P (x) ln Q_{θ} (x)$ (note the negative sign). The entropy of $P$ is $H (P) = - \sum_{x} P (x) ln P (x)$ . Since $D_{K L} (P | | Q_{θ}) = \sum_{x} P (x) (ln (P (x) - ln Q_{θ} (x)) = \sum_{x} P (x) ln P (x) - \sum_{x} P (x) ln Q_{θ} (x) = - H (P) + H (P, Q_{θ})$ then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over $θ$ will minimise (not maximise) the KL divergence.

I believe the next paragraph is still correct: the maximum likelihood estimator $θ^{*}$ is the parameter which maximises $L ({^P}_{n}; Q_{θ})$ , which minimises the cross-entropy, which minimises the KL divergence.

Apologies if any of what I've said above is incorrect, I'm not an expert on this.

[-]Ethan (EJ) Watkins2y10

I think there is a mistake in this equation. $I_{P} (X)$ and $I_{Q} (X)$ are the wrong way round. It should be:
$D_{K L} (P | | Q) = \sum x p_{x} (ln p_{x} - ln q_{x}) = E [I_{Q} (X) - I_{P} (X)]$