The Joys of Conjugate Priors

TCB

(Warning: this post is a bit technical.)

Suppose you are a Bayesian reasoning agent. While going about your daily activities, you observe an event of type $x$ . Because you're a good Bayesian, you have some internal parameter $\theta$ which represents your belief that $x$ will occur.

Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of $x$ is a datapoint, and thus you'll want to modify $\theta$ . But how much should this datapoint influence $\theta$ ? Well, that will depend on how sure you are of $\theta$ in the first place. If you calculated $\theta$ based on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of $x$ shouldn't have much impact. But if your estimate of $\theta$ is just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of $\theta$ .

Of course, when you reestimate $\theta$ , you'll also have to reestimate how confident you are in its value. Or, to put it a different way, you'll want to compute a new probability distribution over possible values of $\theta$ . This new distribution will be $P(\theta|x)$ , and it can be computed using Bayes' rule:

$P(\theta|x)=\frac{P(x|\theta)P(\theta)}{\int P(x|\theta)P(\theta)d\theta}$

Here, since $\theta$ is a parameter used to specify the distribution from which $x$ is drawn, it can be assumed that computing $P(x|\theta)$ is straightforward.   $P(\theta)$ is your old distribution over $\theta$ , which you already have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of $\theta$ . So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.

But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions. If this is your goal, then once you've computed the distribution $P(\theta|x)$ , you can pick the value of $\theta$ that maximizes it. This will be your new parameter, and because you have the formula $P(\theta|x)$ , you'll know exactly how confident you are in this parameter.

In practice, picking the value of $\theta$ which maximizes $P(\theta|x)$ is usually pretty difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems. For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum. But for more complicated distributions, even this method is intractable, and approximate algorithms must be used. Because of this concern, it's important to keep the distributions $P(x|\theta)$ and $P(\theta)$ simple. Choosing the distribution $P(x|\theta)$ is a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.

It is assumed that the type of model is chosen before deciding on the form of the distribution $P(\theta)$ . So how do you choose a good distribution for $P(\theta)$ ? Notice that every time you see a new datapoint, you'll have to do the computation in the equation above. Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together. If these distributions are chosen poorly, $P(\theta)$ could get quite messy very quickly.

If you're a smart Bayesian agent, then, you'll pick $P(\theta)$ to be a conjugate prior to the distribution $P(x|\theta)$ . The distribution $P(\theta)$ is conjugate to $P(x|\theta)$ if multiplying these two distributions together and normalizing results in another distribution of the same form as $P(\theta)$ .

Let's consider a concrete example: flipping a biased coin. Suppose you use the bernoulli distribution to model your coin. Then it has a parameter $\theta$ which represents the probability of gettings heads. Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails. Then the distribution of the outcome $x$ of the coin flip looks like this:

$P(x|\theta)=\theta^x(1-\theta)^{1-x}$

It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution. It has two parameters, $\alpha$ and $\beta$ , which we call hyperparameters because they are parameters for a distribution over our parameters. (Eek!)

The beta distribution looks like this:

$P(\theta|\alpha,\beta)=\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{\int_0^1\theta^{\alpha-1}(1-\theta)^{\beta-1}d\theta}$

Since $\theta$ represents the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.

Suppose you observe a single coin flip $x$ and want to update your beliefs regarding $\theta$ . Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing $P(\theta|x)$ , as long as you promise to normalize after completing the computation:

$\begin{align*} P(\theta|x) &\propto P(x|\theta)P(\theta) \\ &\propto \Big(\theta^x(1-\theta)^{1-x}\Big) \Big(\theta^{\alpha-1}(1-\theta)^{\beta-1}\Big) \\ &=\theta^{x+\alpha-1}(1-\theta)^{(1-x)+\beta-1} \end{align*}$

Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution. Super cool, right?

If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for $P(\theta|\alpha,\beta)$ looks remarkably similar to the non-factorial part of the binomial distribution. This suggests a form for the normalization constant:

$P(\theta|\alpha,\beta) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{\alpha-1}(1-\theta)^{\beta-1}$

The beta and binomial distributions are almost identical. The biggest difference between them is that the beta distribution is a function of $\theta$ , with $\alpha$ and $\beta$ as prespecified parameters, while the binomial distribution is a function of $\alpha$ , with $\theta$ and $\alpha+\beta$ as prespecified parameters. It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer.

Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials. Recall that the gamma function is just a generalization of the factorial to the reals; thus, the beta distribution allows $\alpha$ and $\beta$ to be any positive real number, while the binomial distribution is only defined for integers. As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not contain +1s. For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf: http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf. It gives an actual derivation which shows that the first equation for $P(\theta|\alpha,\beta)$ is equivalent to the second equation for $P(\theta|\alpha,\beta)$ , which is nice if you don't find the argument by analogy to the binomial distribution convincing.

So, what is the philosophical significance of the conjugate prior? Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to? No; there is deep philosophical significance to the form of the beta distribution.

Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically. If, on the other hand, you've seen relatively little data, then a single datapoint could influence your beliefs significantly. This intuition is captured by the form of the conjugate prior.   $\alpha$ and $\beta$ can be viewed as keeping track of how many heads and tails you've seen, respectively. So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior. The beta distribution captures the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.

Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be. Then you can choose values for $\alpha$ and $\beta$ that represent your prior understanding of the coin. Higher values of $\alpha+\beta$ indicate more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.   $\alpha$ and $\beta$ will act like "imaginary data"; when you update your distribution over $\theta$ after observing a coin flip $x$ , it will be like you already saw $\alpha$ heads and $\beta$ tails before that coin flip.

If you want to express that you have no prior knowledge about the system, you can do so by setting $\alpha$ and $\beta$ to 1. This will turn the beta distribution into a uniform distribution. You can also use the beta distribution to do add-N smoothing, by setting $\alpha$ and $\beta$ to both be N+1. Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting $\theta$ to noise in the actual data.

In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome. It makes it possible to do Bayesian reasoning in a computationally efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data. Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.

This is a fantastic post! Well done.

That said, I have quibbles that relate to the philosophical import ascribed to the beta distribution:

the beta distribution is an excellent exemplar of the notion of the comparative weight of evidence in the prior vs the data, but the notion is much more general;
priors should ideally reflect the actual information at one's disposal, and thus should rarely actually be conjugate;
it's controversial to claim that alpha = beta = 1 expresses no prior knowledge; other proposals include the improper alpha = beta = 0 and Jeffreys' prior, alpha = beta = 0.5.

And one other complaint: using the notion of picking a "best" value of theta for prediction to motivate the subsequent discussion was a misstep. If prediction is the goal, then the Bayesian procedure is to formulate the joint distribution of theta and the as-yet-unobserved data and then treat theta as a nuisance parameter and integrate over it.

In spite of the above criticisms, I consider this post yeoman's work -- it deserves more upvotes than I can give it.

Thank you very much for the compliments, and for the honest criticism!

I am still thinking about your comment, and I intend to write a detailed response to it after I have thought about your criticisms more completely. In the meantime, though, I wanted to say that the feedback is very much appreciated!

After rereading this, I agree with you that I emphasized the beta distribution too heavily. This wasn't my intention; I just picked it because it was the simplest conjugate prior I could find. In the next draft of this document, I'll make sure to stress that the beta distribution is just one of many great conjugate priors!

I am a bit confused about what the second point means. Do you mean that conjugate priors are insufficient for capturing the actual prior knowledge possessed?

I did not know that it was controversial to claim that alpha = beta = 1 expresses no prior knowledge! I think I still prefer alpha = beta = 1 to the other choices, since the uniform distribution has the highest entropy of any continuous distribution over [0,1]. What are the benefits of the other two proposals?

Your last complaint is something I was worried about when I wrote this. Part of why I wrote it like that was because I figured people would be more familiar with the MLE/MAP style of prediction. Thanks to your feedback, though, I think I'll change that in my next draft of this document.

Again, thank you so much for the detailed criticism; it is very much appreciated! =)

The improper alpha = beta = 0 prior, sometimes known as Haldane's prior, is derived using an invariance argument in Jaynes's 1968 paper Prior Probabilities. I actually don't trust that argument -- I find the critiques of it here compelling.

Jeffreys priors are derived from a different invariance argument; Wikipedia has a pretty good article on the subject.

I have mostly used the uniform prior myself in the past, although I think in the future I'll be using the Jeffreys prior as a default for the binomial likelihood. But the maximum entropy argument for the uniform prior is flawed: differential entropy is not an extension of discrete Shannon entropy to continuous distributions. The correct generalization is to relative entropy. Since the measure is arbitrary, the maximum entropy argument is missing an essential component.

I think LessWrong could use more posts on actual technical topics in machine learning, and this is a nice first step. It would be good if there was a sequence on it.

You might want to include the link to the Wikipedia table of conjugate priors in your post, and at least a mention of exponential families.

If you're a smart Bayesian agent, then, you'll pick p(theta) to be a conjugate prior

While conjugate priors can be very useful computationally, it might also be the case that your data is not well-modeled by the conjugate prior (if you're using the Naieve bayes model then this might not seem like a huge problem, but once you start trying to build hierarchical models using conjugate priors, you have more potential to run into problems).

I would love to see an LW sequence on machine learning! I imagine that LW would have a lot of interesting things to say about the philosophical aspects of ML in addition to the practical aspects.

I'm not sure I'd be qualified to contribute much to such a sequence, since I am just an undergrad, but I did have an outline in mind for an intuitive introduction to MLE and EM. If people would find that interesting, I could certainly post it on LW once it was written up!

I'm fairly inexperienced in ML, so all the models I've worked with are simple enough that they've had conjugate priors. (I think it's really cool that Dirichlet priors can be used for something as complicated as an HMM, but I guess the HMM is still just a whole bunch of multinomials.) I'm less familiar with hierarchical models. What is an example of a model for which is it difficult to use conjugate priors? The only hierarchical process I've heard about is the Dirichlet process, and I was under the impression (based on the name) that it involved Dirichlet priors somewhere; is this incorrect? I have been meaning to read about hierarchical models, so if you know of any good tutorials or papers on them, I would very much appreciate a link!

Cyan's observation about mixtures of conjugate priors being conjugate kills the example I had in mind. Ill think for a bit and let you know if I think of any examples. If I haven't replied in a couple weeks, remind me and ill make sure to reply.

Dirichlet processes aren't inherently hierarchical, they are just self-conjugate, so you can make the output of one the input to the other. If you connect them up in a tree structure, you get a hierarchical dirichlet process.

Andrew Gelman wrote a comment on someone else's paper that might prove to be a useful introduction to hierarchical models.

But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions.

This is in essence the narrative fallacy. While it can be a useful heuristic, there are dangers for example causing you no neglect outliers and black swans.

This is an excellent article. However, I did have the same philosophical problem that Cyan gave in this bullet point:

priors should ideally reflect the actual information at one's disposal, and thus should rarely actually be conjugate;

You seem to suggest that conjugate prior distributions are "smart" because they update in a computationally tractable way. Certainly, as a concession to practical necessity, we have to take computational tractability into account. But it is controversial to think of doing this as part of the ideal epistemology that we are trying to approximate.

Also, I found myself confused at a few points near the beginning. You write

While going about your daily activities, you observe an event of type x. Because you're a good Bayesian, you have some internal parameter \beta which represents your belief that x will occur.

Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive. Your observation of x is a datapoint, and thus you'll want to modify \beta. But how much should this datapoint influence \beta?

At first, I misread you as saying, in effect, "Given that x occurs, what should be your updated probability that x occurs?" But, of course, your updated probability, conditioned on x's occurring, that x occurs, should be 1.

I also misunderstood you to be proposing to consider the probability of the probability of a given event being such-and-such. That is, I thought that you were proposing to consider a probability of the form P(P(x | y) = p | z), where x, y, and z are events, and p is a number in [0,1]. But, as I understand it, this is not a well-formed notion in Bayesian epistemology.

I think that my confusion arose from your calling \beta an "internal parameter". But, from the subsequent discussion, it seems better to think of \beta as an unknown parameter fed into whatever physical process generated x. For example, \beta could be an unknown parameter fed into a pseudo-random number generator that was observed to output the number x.

This was great to read, but as a layman I find it a bit inconvenient that there is not a block consisting solely of example, from the beginning to the end. I mean, the key words are easily googlable, but somehow, on those few webpages I have looked at, there never is a step-by-step account of what one does in this situation:(

This work articulates an attack on the use of conjugate priors in a Bayesian analysis: http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ba/1340369826 In their words, "conjugate priors may lead to a dogmatic analysis."

Sorry for necro.

Well ... what bothers me? That the alpha and beta should have their own probability distribution each. And so on and on.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.5680&rep=rep1&type=pdf

EDIT: They don't extend the hierarchy infinitely upward in the paper, but there is no reason not to, as far as I can see.

Thanks for the link! Should make good reading. It sounds like you know a fair amount of ML, are you doing research in the field?

I taught myself Bayesian statistics for use in my engineering Ph.D. (My advisor didn't care how I got good answers -- he cared that I got good answers.) Until recently I was a postdoc in a statistics lab, but the research did not focus on what I would consider cutting-edge Bayesian stats.

Does each likelihood distribution have a unique conjugate prior? I doesn't seem immediately obvious that they do, but people say things like "The conjugate prior for the bernoulli distribution is the beta distribution".

No, in general are many conjugate priors for a given likelihood, if for no other reason than any weighted mixture of conjugate priors is also a conjugate prior.

What about the converse - does a conjugate prior exist for each likelihood (assume "nice" families of probability measures with a R-N derivative w.r.t counting measure or lebesgue measure if you like)? I think probably not (with a fairly high degree of certainty) but I don't think I've ever seen a proof of it.

The existence of a conjugate prior is not guaranteed. They exist for members of the exponential family, which is a very broad and useful class of distributions. I don't know of a proof, but if a gun were held to my head, I'd assert with reasonable confidence that the Cauchy likelihood doesn't have a conjugate prior.

I'm pretty sure that the Cauchy likelihood, like the other members of the t family, is a weighted mixture of normal distributions. (Gamma distribution over the inverse of the variance)

EDIT: There's a paper on this, "Scale mixtures of normal distributions" by Andrews and Mallows, if you want the details

Oh, for sure it is. But that only gives it a conditionally conjugate prior, not a fully (i.e., marginally) conjugate prior. That's great for Gibbs sampling, but not for pen-and-paper computations.

In the three years since I wrote the grandparent, I've found a nice mixture representation for any unimodal symmetric distribution:

Suppose f(x), the pdf for a real-valued X, is unimodal and symmetric around 0. If W is positive-valued with pdf g(w) = -w f '(w) and U ~ Unif(-W, W), then U's marginal distribution is the same as X. Proof is by integration-by-parts. ETA: No, wait, it's direct. Derp.

I don't think it would be too hard to convert this width-weighted-mixture-of-uniforms representation to a precision-weighted-mixture-of-normals representation.

It turns out that it's not too difficult to construct a counter example if you restrict the hyper-parameter space of the family of prior distributions. For example, let the likelihood, f(x|theta) only take on two values of theta, so the prior just puts mass p on theta=0 (i.e. P(theta=0) = p )and mass 1-p on theta=1. If you restrict p < 0.5, then the posterior will yield a distribution on theta with p > 0.5 for some likelihoods and some values of x.