For an introduction to MCMC aimed at a similar level target audience, I found this explanation helpful.
Thanks for this sequence, I've read each post 3 or 4 times to try to properly get it.
Am I right in thinking that in order to replace we not only require a uniform prior but also that span unit volume?
Correct. In general, is the probability density of , so if it's uniform on a unit volume then .
The main advantage of this notation is that it's parameterization-independent. For example: in a coin-flipping example, we could have a uniform prior over the frequency of heads , so . But then, we could re-write that frequency in terms of the odds , so we'd get and
So the probability density is equivalent to the density . (That first step, , is because these two variables contain exactly the same information in two different forms - that's the parameterization independence. After that, it's math: substitute and differentiate.)
(Notice that the uniform prior on is not uniform over . This is one of the main reasons why "use a uniform prior" is not a good general-purpose rule for choosing priors: it depends on what parameters we choose. Cartesian and polar coordinate give different "uniform" priors.)
The moral of the story is that, when dealing with continuous probability densities, the fundamental "thing" is not the density function but the density times the differential , which we call . This is important mainly when changing coordinates: if we have some coordinate change , then , but .
If anybody wants an exercise with this: try transforming to a different coordinate system. Apply Laplace' approximation in both systems, and confirm that they yield the same answer. (This should mainly involve applying the chain rule twice to the Hessian; if you get stuck, remember that is a maximum point and consider what that implies.)
The last couple posts compared some specific models for 20000 rolls of a die. This post will step back, and talk about more general theory for Bayesian model comparison.
The main problem is to calculate P[data|model] for some model. The model will typically give the probability of observed data x (e.g. die rolls) based on some unobserved parameter values θ (e.g. the p's in the last two posts), along with a prior distribution over θ. We then need to compute
P[data|model]=∫θP[data|θ]dP[θ]
which will be a hairy high-dimensional integral.
Some special model structures allow us to simplify the problem, typically by factoring the integral into a product of one-dimensional integrals. But in general, we need some method for approximating these integrals.
The two most common approximation methods used in practice are Laplace approximation around the maximum-likelihood point, and MCMC (see e.g. here for application of MCMC to Bayes factors). We'll mainly talk about Laplace approximation here - in practice MCMC mostly works well in the same cases, assuming the unobserved parameters are continuous.
Laplace Approximation
Here's the idea of Laplace approximation. First, posterior distributions tend to be very pointy. This is mainly because independent probabilities multiply, so probabilities tend to scale exponentially with the number of data points. Think of the probabilities we calculated in the last two posts, with values like 10−70 or 10−20 - that's the typical case. If we're integrating over a function with values like that, we can basically just pay attention to the region around the highest value - other regions will have exponentially small weight.
Laplace' trick is to use a second-order approximation within that high-valued region. Specifically, since probabilities naturally live on a log scale, we'll take a second order-approximation of the log likelihood around its maximum point. Thus:
∫θelnP[data|θ]dP[θ]≈∫θelnP[data|θmax]+12(θ−θmax)T(d2lnPdθ2|θmax)(θ−θmax)dP[θ]
If we assume that the prior dP[θ] is uniform (i.e. dP[θ]=dθ), then this looks like a normal distribution on θ with mean θmax and variance given by the inverse Hessian matrix of the log-likelihood. (It turns out that, even for non-uniform dP[θ], we can just transform θ so that the prior looks uniform near θmax, and transform it back when we're done.) The result:
∫θelnP[data|θ]dP[θ]≈P[data|θmax]p[θmax](2π)k2det(−d2lnPdθ2|θmax)−12
Let's walk through each of those pieces:
A bit more detail on that last piece: intuitively, each eigenvalue of the Fisher information matrix tells us the approximate width of the peak in a particular direction. Since the matrix is the inverse variance (in one dimension 1σ2) of our approximate normal distribution, and "width" of the peak of a normal distribution corresponds to the standard deviation σ, we use an inverse square root (i.e. the power of −12) to extract a width from each eigenvalue. Then, to find how much volume the peak covers, we multiply together the widths along each direction - thus the determinant, which is the product of eigenvalues.
Why do we need eigenvalues? The diagram above shows the general idea: for the function shown, the two arrows would be eigenvectors of the Hessian d2lnPdθ2 at the peak. Under a second-order approximation, these are principal axes of the function's level sets (the ellipses in the diagram). They are the natural directions along which to measure the width. The eigenvalue associated with each eigenvector tells us the width, and then taking their product (via the determinant) gives a volume. In the picture above, the determinant would be proportional to the volume of any of the ellipses.
Altogether, then, the Laplace approximation takes the height of the peak (i.e. P[data|θmax]p[θmax]) and multiplies by the volume of θ-space which the peak occupies, based on a second-order approximation of the likelihood around its peak.
Laplace Complexity Penalty
The Laplace approximation contains our first example of an explicit complexity penalty.
The idea of a complexity penalty is that we first find the maximum log likelihood lnP[data|θmax], maybe add a term for our θ-prior lnp[θmax], and that's the "score" of our model. But more general models, with more free parameters, will always score higher, leading to overfit. To counterbalance that, we calculate some numerical penalty which is larger for more complex models (i.e. those with more free parameters) and subtract that penalty from the raw score.
In the case of Laplace approximation, a natural complexity penalty drops out as soon as we take the log of the approximation formula:
lnP[data|model]≈lnP[data|θmax]+lnp[θmax]+k2ln(2π)−12lndet(−d2lnPdθ2|θmax)
The last two terms are the complexity penalty. As we saw above, they give the (log) volume of the likelihood peak in θ-space. The wider the peak, the larger the chunk of θ-space which actually predicts the observed data.
There are two main problems with this complexity penalty:
Historically, a third issue was the math/coding work involved in calculating a Hessian, but modern backprop tools like Tensorflow or autograd make that pretty easy; I expect in the next few years we'll see a lot more people using a Laplace-based complexity penalty directly. The O(k3) runtime remains a serious problem for large-scale models, however, and that problem is unlikely to be solved any time soon: a linear-time method for computing the Hessian log determinant would yield an O(n2) matrix multiplication algorithm.