I keep meaning to write a post on this...
At least within Bayesian probability, there is a single unique unambiguously-correct answer to "how should we penalize for complexity?": the Bayes factor. The Bayes factor is Hard to compute in general, which is why there's a whole slew of of other numbers which approximate it in various ways.
Here's how Bayes factors work. Want to know whether model 1 or model 2 is more consistent with the data? Then compute and . Using Bayes' rule:
where Z is the normalizer. If we're just comparing two models, then we can get rid of that annoying Z by computing odds for the two models:
In English: posterior relative odds of the two models is equal to prior odds times the ratio of likelihoods. That likelihood ratio is the Bayes' factor: it directly describes the update in the relative odds of the two models, due to the data.
20 coin flips yield 16 heads and 4 tails. Is the coin biased?
Here we have two models:
The second model has one free parameter (the bias) which we can use to fit the data, but it’s more complex and prone to over-fitting. When we integrate over that free parameter, it will fit the data poorly over most of the parameter space - thus the "penalty" associated with free parameters in general.
In this example, the integral is exactly tractable (it's a dirichlet-multinomial model), and we get:
So the Bayes factor is (.048)/(.0046) ~ 10, in favor of a biased coin. In practice, I'd say unbiased coins are at least 10x more likely than biased coins in day-to-day life a priori, so we might still think the coin is unbiased. But if we were genuinely unsure to start with, then this would be pretty decent evidence in favor.
Why is this hard to compute in general? Well, if the model has an m-dimensional vector of free parameters , then , which is an integral in m dimensions. That's not just NP-complete, it's #P-complete.
There's various ways to approximate that integral. The two most common are:
The BIC can be derived directly from the Laplace approximation. Laplace says:
where is the maximum likelihood point and k is the dimension of .
Notice that is the usual "score" of a model; the rest is what we can think of as a complexity penalty, and that's what the BIC will approximate. The main assumption is that scales with the number of data points N; this is trivial if the data points are iiid (given ), since in that case is a sum over data points.
Now for the trick: that's the only part of the Laplace approximation's "complexity penalty" which scales with the number of data points N. It scales in proportion to , since each of k dimensions of the determinant contributes a factor N. Take the log of the whole thing, and all the rest will be constants which are dominated by the term as N goes to infinity. And that's the BIC: log of the MAP probability, minus .
Finally, the important part: the main benefit of knowing all this is being able to predict when the BIC will and will not work well. If it's an approximation of Laplace, which is itself an approximation of the Bayes factor, then BIC will work well exactly when those approximations are accurate. In particular, BIC fails when the terms we ignored (i.e. ) are as large as or even larger than , i.e. the BIC itself.
Very helpful comment. But why the uniform prior, and not the bias that has the highest likelihood given the data? theta = 4/5 would give you a Bayes factor of 47, right?
First things first, keep in mind the thing we're actually interested in: P[model | data]. The Bayes factor summarizes the contribution of the data to that number, in a convenient and reusable way, but the Bayes factor itself is not what we really want to know.
In the example, one model is "unbiased coin", the other is "some unknown bias". The prior over the bias is our distribution on before seeing the data. Thus the name "prior". It's , not . Before seeing the data, we have no reason at all to think that 4/5 is special. Thus, a uniform prior. This all comes directly from applying the rules of probability:
If you stick strictly to the rules of probability itself, you will never find any reason to maximize anything. There is no built-in maximization anywhere in probability. There are lots of integrals (or sums, in the discrete case), and we usually approximate those integrals by maximizing things - but those are approximations, always. In particular, if you try to compute P[model | data], you will not find any maximization of anything - unless you introduce it to approximate an i...
Thanks for this. I’m trying to get an intuition on how this works.
My mental picture is to imagine the likelihood function with respect to theta of the more complex model. The simpler model is the equivalent of a square function with height of its likelihood and width 1.
The relative areas under the graphs reflect the likelihood of the models. So if picturing the relative maximum likelihoods and how sharp the peak is on the more complex model gives an impression of the Bayes factor.
Does that work? Or is there a better mental model?
A different view is to look at the search process for the models, rather than the model itself. If model A is found from a process that evaluates 10 models, and model B is found from a process that evaluates 10,000, and they otherwise have similar results, then A is much more likely to generalize to new data points than B.
The formalization of this concept is called VC dimension and is a big part of Machine Learning Theory (although arguably it hasn't been very helpful in practice): https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension
Is there a theoretical justification for using VC dimension as a basis for generalization, or is it treated as an axiomatic desideratum?
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dimension.) This is the one I linked above.
2. The error on real data will be similar to the error of the best hypothesis in the hypothesis class, with similar proportionality
3. Special case of 2 – if the true hypothesis is in the hypothesis class, then the absolute error will be < epsilon (since the absolute error is just the difference from the true, best hypothesis.)
3 is probably the one you're thinking of, but you don't need the hypothesis to be in the class.
The exact Bayesian solution penalizes complex models as a side effect. Each model should have a prior over its parameters. The more complex model can fit the data better, so P(data | best-fit parameters, model) is higher. But the model gets penalized because P(best-fit parameters | model) is lower on the prior. Why? The prior is thinly spread over a higher dimensional parameter space so it is lower for any particular set of parameters. This is called "Bayesian Occam's razor".
If one has 2 possible models to fit a data set, by how much should one penalize the model which has an additional free parameter?
Penalization might not be necessary if your learning procedure is stochastic and favors simple explanations. I encourage you to take a look on the nice poster/paper «Deep learning generalizes because the parameter-function map is biased towards simple functions» (PAC-Bayesian learning theory + empirical intuitions).
Previously I asked about Solomonoff induction but essentially I asked the wrong question. Richard_Kennaway pointed me in the direction of an answer to the question which I should have asked but after investigating I still had questions.
So:
If one has 2 possible models to fit to a data set, by how much should one penalise the model which has an additional free parameter?
A couple of options which I came across were:
AIC, which has a flat facter of e penalty for each additional parameter.
BIC, which has a factor of √n penalty for each additional parameter.
where n is the number of data points.
On the one hand having a penalty which increases with n makes sense - a useful additional parameter should be able to provide more evidence the more data you have. On the other hand, having a penalty which increases with n means your prior will be different depending on the number of data points which seems wrong.
So, count me confused. Maybe there are other options which are more helpful. I don't know if the answer is too complex for a blog post but, if so, any suggestions of good text books on the subject would be great.
EDIT: johnswentworth has written a sequence which expands on the answer which he gives below.