I keep meaning to write a post on this...
At least within Bayesian probability, there is a single unique unambiguously-correct answer to "how should we penalize for complexity?": the Bayes factor. The Bayes factor is Hard to compute in general, which is why there's a whole slew of of other numbers which approximate it in various ways.
Here's how Bayes factors work. Want to know whether model 1 or model 2 is more consistent with the data? Then compute P[model1|data] and P[model2|data]. Using Bayes' rule:
P[modeli|data]=1ZP[data|modeli]P[modeli]
where Z is the normalizer. If we're just comparing two models, then we can get rid of that annoying Z by computing odds for the two models:
P[model1|data]P[model2|data]=P[data|model1]P[data|model2]P[model1]P[model2]
In English: posterior relative odds of the two models is equal to prior odds times the ratio of likelihoods. That likelihood ratio P[data|model1]P[data|model2] is the Bayes' factor: it directly describes the update in the relative odds of the two models, due to the data.
Example
20 coin flips yield 16 heads and 4 tails. Is the coin biased?
Here we have two models:
- Model 1: coin unbiased
- Model 2: coin has some unknown probability θ of coming up heads (we'll use a uniform prior on θ for simplicity; this is not a good idea in general)
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space - thus the "penalty" associated with free parameters in general.
In this example, the integral is exactly tractable (it's a dirichlet-multinomial model), and we get:
- P[data|model1]=(204)(12)16(12)4=0.0046
- P[data|model2]=∫θ(204)(θ)16(1−θ)4dθ=0.048
So the Bayes factor is (.048)/(.0046) ~ 10, in favor of a biased coin. In practice, I'd say unbiased coins are at least 10x more likely than biased coins in day-to-day life a priori, so we might still think the coin is unbiased. But if we were genuinely unsure to start with, then this would be pretty decent evidence in favor.
Approximation
Why is this hard to compute in general? Well, if the model has an m-dimensional vector of free parameters θ, then P[data|model]=∫θP[data|model,θ]dP[θ|model], which is an integral in m dimensions. That's not just NP-complete, it's #P-complete.
There's various ways to approximate that integral. The two most common are:
- MCMC (see e.g. here for application to Bayes factors): main limitation is that MCMC gets finicky in high dimensions.
- Laplace approximation around the maximum-likelihood point. Aside from the usual limitations of using a single maximum-likelihood point, main limitation is that we need the determinant of the Hessian, which takes somewhere between O(k) and O(k3) to compute for most k-free-parameter models (depending on how clever we are with the linear algebra).
BIC
The BIC can be derived directly from the Laplace approximation. Laplace says:
∫θP[data|θ]p[θ]dθ≈P[data|θ∗]p[θ∗](2π)k2det(−∂2lnP[data|θ]∂θ2)−12
where θ∗ is the maximum likelihood point and k is the dimension of θ.
Notice that ln(P[data|θ∗]p[θ∗]) is the usual "score" of a model; the rest is what we can think of as a complexity penalty, and that's what the BIC will approximate. The main assumption is that ∂2lnP[data|θ]∂θ2 scales with the number of data points N; this is trivial if the data points are iiid (given θ), since in that case lnP[data|θ] is a sum over data points.
Now for the trick: that's the only part of the Laplace approximation's "complexity penalty" which scales with the number of data points N. It scales in proportion to N−k2, since each of k dimensions of the determinant contributes a factor N. Take the log of the whole thing, and all the rest will be constants which are dominated by the k2lnN term as N goes to infinity. And that's the BIC: log of the MAP probability, minus k2lnN.
Finally, the important part: the main benefit of knowing all this is being able to predict when the BIC will and will not work well. If it's an approximation of Laplace, which is itself an approximation of the Bayes factor, then BIC will work well exactly when those approximations are accurate. In particular, BIC fails when the terms we ignored (i.e. ln((2π)k2det(−1N∂2lnP[data|θ]∂θ2)−12)) are as large as or even larger than ln(P[data|θ∗]p[θ∗])−k2lnN, i.e. the BIC itself.
Penalization might not be necessary if your learning procedure is stochastic and favors simple explanations. I encourage you to take a look on the nice poster/paper «Deep learning generalizes because the parameter-function map is biased towards simple functions» (PAC-Bayesian learning theory + empirical intuitions).