From Laplace to BIC

The previous post outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we'll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.

The BIC is pretty simple:

Start with the maximum log likelihood $l n P [d a t a | θ_{m a x}]$
Subtract $\frac{k}{2} l n N$ to penalize for complexity, where N is the number of data points and k is the number of free parameters (i.e. dimension of $θ$ ).

Thus: $l n P [d a t a | θ_{m a x}] - \frac{k}{2} l n N$ . Using this magic number, we compare any two models we like.

Let's derive that.

BIC Derivation

As usual, we'll start from $P [d a t a | m o d e l]$ . (Caution: don't forget that what we really care about is $P [m o d e l | d a t a]$ ; we can jump to $P [d a t a | m o d e l]$ only as long as our priors are close enough to be swamped by the evidence.) This time, we'll assume that we have N independent data points $x_{i}$ , all with the same unobserved parameters - e.g. N die rolls with the same unobserved biases. In that case, we have

$P [d a t a | m o d e l] = \int_{θ} P [d a t a | θ] d P [θ] = \int_{θ} \prod_{i = 1}^{N} P [x_{i} | θ] p [θ] d θ$

Next, apply Laplace approximation and take the log.

$l n P [d a t a | m o d e l] \approx \sum_{i} l n P [x_{i} | θ_{m a x}] + l n p [θ_{m a x}] + \frac{k}{2} l n (2 π) - \frac{1}{2} l n d e t (H)$

where the Hessian matrix H is given by

$H = \frac{d^{2}}{d θ^{2}} l n P [d a t a | θ] |_{θ_{m a x}} = \sum_{i} \frac{d^{2}}{d θ^{2}} l n P [x_{i} | θ] |_{θ_{m a x}}$

Now for the main trick: how does each term scale as the number of data points N increases?

The max log likelihood $\sum_{i} P [x_{i} | θ_{m a x}]$ is a sum over data points, so it should scale roughly proportionally to N.
The prior density and the $\frac{k}{2} l n (2 π)$ are constant with respect to N.
H is another sum over data points, so it should also scale roughly proportionally to N.

Let's go ahead and write H as $N * (\frac{1}{N} H)$ , to pull out the N-dependence. Then, if we can remember how determinants scale:

$l n d e t (N * (\frac{1}{N} H)) = l n d e t (\frac{1}{N} H) + k * l n N$

so we can re-write our Laplace approximation as $l n P [d a t a | m o d e l] \approx \sum_{i} l n P [x_{i} | θ_{m a x}] + p [θ_{m a x}] + \frac{k}{2} l n (2 π) - \frac{1}{2} l n d e t (\frac{1}{N} H) - \frac{k}{2} l n N = l n P [d a t a | θ_{m a x}] - \frac{k}{2} l n N + O (1)$

where $O (1)$ contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.

In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don't scale up as the number of data points increases.

When Does BIC Work?

What conditions need to hold for BIC to work? Let's go back through the derivation and list out the key assumptions behind our approximations.

First, in order to jump from $P [m o d e l | d a t a]$ to $P [d a t a | m o d e l]$ , our models should have roughly similar prior probabilities $P [m o d e l]$ - i.e. within a few orders of magnitude.
Second, in order for any point approximation to work, the posterior parameter distribution needs to be pointy and unimodal - most of the posterior probability mass must be near $θ_{m a x}$ . In other words, we need enough data to get a precise estimate of the unobserved parameters.
Third, we must have N large enough that $\frac{k}{2} l n N$ (the smallest term we're keeping) is much larger than the $O (1)$ terms we're throwing away.

That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big $l n d e t (\frac{1}{N} H)$ is - N needs to be exponentially larger than that. We'll see an example in the next post.

Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We'll see a concrete example of when BIC works/fails.

[-]Bucky6y10

In removing the $O (1)$ terms I think we're removing all of the widths of the peak in the various dimensions. So in the case where the widths are radically different between the models this would mean that N would need to be even larger for BIC to be a useful approximation.

The widths issue might come up, for example, when an additional parameter is added which splits the data into 2 populations with drastically different population sizes - the small population is likely to have a wider peak.

Is that right?

[-]johnswentworth6y30

That is exactly correct.

LESSWRONG
LW

LESSWRONG
LW

21

From Laplace to BIC

21

BIC Derivation

When Does BIC Work?

21