In removing the terms I think we're removing all of the widths of the peak in the various dimensions. So in the case where the widths are radically different between the models this would mean that N would need to be even larger for BIC to be a useful approximation.
The widths issue might come up, for example, when an additional parameter is added which splits the data into 2 populations with drastically different population sizes - the small population is likely to have a wider peak.
Is that right?
The previous post outlined Laplace approximation, one of the most common tools used to approximate hairy probability integrals. In this post, we'll use Laplace approximation to derive the Bayesian Information Criterion (BIC), a popular complexity penalty method for comparing models with more free parameters to models with fewer free parameters.
The BIC is pretty simple:
Thus: lnP[data|θmax]−k2lnN. Using this magic number, we compare any two models we like.
Let's derive that.
BIC Derivation
As usual, we'll start from P[data|model]. (Caution: don't forget that what we really care about is P[model|data]; we can jump to P[data|model] only as long as our priors are close enough to be swamped by the evidence.) This time, we'll assume that we have N independent data points xi, all with the same unobserved parameters - e.g. N die rolls with the same unobserved biases. In that case, we have
P[data|model]=∫θP[data|θ]dP[θ]=∫θ∏Ni=1P[xi|θ]p[θ]dθ
Next, apply Laplace approximation and take the log.
lnP[data|model]≈∑ilnP[xi|θmax]+lnp[θmax]+k2ln(2π)−12lndet(H)
where the Hessian matrix H is given by
H=d2dθ2lnP[data|θ]|θmax=∑id2dθ2lnP[xi|θ]|θmax
Now for the main trick: how does each term scale as the number of data points N increases?
Let's go ahead and write H as N∗(1NH), to pull out the N-dependence. Then, if we can remember how determinants scale:
lndet(N∗(1NH))=lndet(1NH)+k∗lnN
so we can re-write our Laplace approximation as lnP[data|model]≈∑ilnP[xi|θmax]+p[θmax]+k2ln(2π)−12lndet(1NH)−k2lnN=lnP[data|θmax]−k2lnN+O(1)
where O(1) contains all the terms which are roughly constant with respect to N. The first two terms are the BIC.
In other words, the BIC is just the Laplace approximation, but ignoring all the terms which don't scale up as the number of data points increases.
When Does BIC Work?
What conditions need to hold for BIC to work? Let's go back through the derivation and list out the key assumptions behind our approximations.
That last condition is the big one. BIC is a large-N approximation, so N needs to be large for it to work. How large? That depends how big lndet(1NH) is - N needs to be exponentially larger than that. We'll see an example in the next post.
Next post will talk more about relative advantages of BIC, Laplace, and exact calculation for comparing models. We'll see a concrete example of when BIC works/fails.