This is a coin.

It might be biased.

This is Bayes’ theorem.

Bayes’ theorem tells us how we ought to update our beliefs given evidence.

It involves the following components:

  • , called the posterior; it is the probability of A given B. In the case of the coin this is the probability that the coin is biased given the result of an experiment (i.e., a sequence of flips).
  • , called the likelihood, is the probability of B given A. For our coin example this would be the probability of some particular ratio of heads to tails, given that the coin is biased.
  •  is called the prior. This is the probability that the coin is biased, before we consider any evidence.
  •  is the marginal. This is probability of us getting some sequence of head and tails, before considering any evidence.

The overall shape of the theorem is this:

Posterior  likelihood  prior


If you were explain this to a high-school student, they might ask this naïve question:

Why should we bother to go through the process of calculating the likelihood and prior at all? Why can’t we just try and directly calculate the posterior? We have a formula for , namely .

Maybe you'll say "That formula is fine but not useful in real life. It's usually tractable to go via conditional updates rather than the high school definition."

But if conditionals are easy to get, why not just go directly to the posterior? What's even the difference between A and B? Aren't they just symbols? We could easily rearrange the theorem to calculate  as a function of .

What is it that makes using strings of coin flips to calculate biases more natural or scientific?

Perhaps it is ease. If it is the case that for some reason calculating  is easier, what makes it easier?

Perhaps it is usefulness. If likelihoods are what's worth publishing, not posteriors, why are they worthier?

How do you spot a likelihood in the wild?

New Comment
1 comment, sorted by Click to highlight new comments since:

One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes' Rule is useful.

Why so?

I think one reason is that we make it the "responsibility" of hypotheses to give their likelihood functions. After all, what is a hypothesis? It's just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.

We want the posterior because it tells us how much faith to place in the various hypotheses -- that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.

So, in some sense, we use Bayes' Rule because we aren't sure how to assign probabilities, but we can come up with several candidate options.

One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don't come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don't give a full probability distribution. We've got to "soften" these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking.