a visual explanation of Bayesian updating

Jan Christian Refsgaard

As a teaser here is the visual version of Bayesian updating:

svg

But in order to understand that figure we need to go through the prior and likelihood!

You find me standing in a basketball court ready to shoot some hoops. What do you believe about my performance before I take a shot?. There are no good Null hypothesis here unless you happen to have a lot of knowledge about the average human basket ball performance!, and even so, why do you care whether I am significant different from the average?, You can fall back to the new statistics which is almost as good as the Bayesian approach, it but does not answer what you should believe before I take a shot.

The Beta distribution is a popular prior for binary events, when the two parameter ( and $β$ ) are equal to 1, it is uniform. Since you my dear reader have no concept about my basket skills you assume a $θ$ comes from a $B e t a (1, 1)$ distribution, formally:

$p (θ) \sim B e t a (1, 1)$

Where $θ$ is my probability of scoring, the distribution looks like this:

svg

Completely Uniform, a great prior when you are totally oblivious.

I take a shot and miss ( $z = 0$ ), the likelihood of a miss looks like this:

svg

(if you are extra currious, you can brush up on the math behind all the binary distributions here)

Notice that:

$p (z = 0 ∣ θ = 0) = 1$ , the likelihood that I always miss is 1
$p (z = 0 ∣ θ = 0.5) = 0.5$ , the likelihood that I miss half the time is 0.5
$p (z = 0 ∣ θ = 1) = 0$ , the likelihood that I always hit is 0, which is obvious as I can't score all the time if I just missed.

Notice that these likelihoods and not probabilities, but how likely the data are for different values of $θ$ , so it is twice as likely:

$\frac{p (z = 0 ∣ θ = 0)}{p (z = 0 ∣ θ = 0.5)} = \frac{1}{0.5} = 2$

That the data $z = 0$ was generated by $θ = 0$ compared to $θ = 0.5$ .

Bayesian Updating Math

Here is Bayes theorem for the Bernoulli distribution with a Beta prior, where the parameter $z$ is 1 when I score and 0 otherwise:

$p (θ | z) = \frac{p (z ∣ θ) p (θ)}{p (z)}$

For technical reason $p (z)$ , the probability of the data, is difficult to calculate, it is however 'just a normalization constant' because it does not depend on $θ$ which is my scoring probability, thus we can simply drop it and get an unnormalized posterior:

$p (θ | z) \propto p (z ∣ θ) p (θ)$

An unnormalized posterior is simply a density function that does not sum to 1, when we plot it, it looks 'correct' except we have screwed up the numbers on the y axis.

Visual Bayesian Updating

So now we have a 'square' prior $p (θ) \sim B e t a (1, 1)$ and we have a triangle likelihood $p (z = 0 ∣ θ)$ , if we multiply them together we get the unnormalized posterior, so we do:

$p (θ | z) \propto p (z ∣ θ) p (θ)$

Which intuitively can be taught of as: the square makes everything equally likely, so the likelihood will dominate the posterior, or in dodgy math:

$p o s t e r i o r \propto s q u a r e \times t r i a n g l e \propto t r i a n g l e$

Here is the Figure:

svg

Try to put your finger on the figure check that $θ = 0.5$ is 1 for the square and 0.5 for the triangle and is thus $1 \times 0.5 = 0.5$ in the unnormalized posterior

I shoot again and score!

Now we use the previous posterior as the new prior, but because we score we get an 'opposite triangle' which is the likelihood of $p (z = 1 ∣ θ)$

Again we multiply the prior triangle by the likelihood triangle and get a blob centered on 0.5 as the posterior:

svg

Notice how the posterior is peaked at $θ = 0.5$ , this is because the two triangles at the center have an unnormalized posterior density of $0.5 \times 0.5 = 0.25$ where at edges such as $θ = 0.9$ they have $0.9 \times 0.1 = 0.09$

I shoot again and score!

So now again the previous blob posterior is our new prior, which we multiply by the 'I scored triangle' resulting in a blob that has a mode above 0.5, which makes sense as I made 2/3 shots:

svg

While this may seem like a cute toy example it's a totally valid way of solving a Bayesian posterior, and is the way all most popular bayesian books (Gelman^[1], Kruschke^[2] and McElreath^[3]) introduce the concept!

Bayesian Updating using Conjugation

In the case of the Bernoulli events we can actually solve the posterior easily because the Beta is conjugated to the Bernoulli, conjugation is simply fancy statistics speak for it having a simple mathematical form, and that form is also a Beta distribution, thus you can update the beta distribution using this simple rule:

$B e t a (α + z, β + 1 - z)$

So we Started with a prior with $α = β = 1$

$B e t a (1, 1)$

Then we got a miss, z=0

$B e t a (1, 2)$

Then we got a hit, z=1

$B e t a (2, 2)$

Then we got a hit, z=1

$B e t a (3, 2)$

We can plot the $B e t a (3, 2)$ posterior

svg

Notice how the this posterior has the exact same shape as the one we got via updating, the only different is the numbers on the y-axis.

(Hi, if you made it this far please comment, if there were something that was not well explained, I care more about my statistics communication skills than my ego, so negative feedback is very welcome)

Gelman, Hill and Vehtari, “Regression and Other Stories” ↩︎
Richard McElreath "Statistical Rethinking" ↩︎
John Kruschke "Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan 2nd Edition" ↩︎

[-]Jan Christian Refsgaard4y*40

I am well aware that nobody asked for this, but here is the proof that the posterior is for the beta-bernoulli model.

We start with Bayes Theorem:

$p (θ ∣ z) = \frac{p (z ∣ θ) p (θ)}{p (z)}$

Then we plug in the definition for the Bernoulli likelihood and Beta prior:

$p (θ ∣ z) = θ^{z} (1 - θ)^{1 - z} \times \frac{θ^{α - 1} (1 - θ)^{β - 1}}{B (α, β)} \times \frac{1}{p (z)}$

Let's collect the powers in the numerator, and things that does not depend on $θ$ in the denominator

$p (θ ∣ z) = \frac{θ^{α + z - 1} (1 - θ)^{β - z}}{B (α, β) p (z)}$

Here comes the conjugation shenanigans. If you squint, the top of the distribution looks like the top of a Beta distribution:

$\begin{matrix} α^{'} & = α + z β^{'} & = β - z + 1 p (θ ∣ z) & = \frac{θ^{α^{'} - 1} (1 - θ)^{β^{'} - 1}}{B (α, β) p (z)} \end{matrix}$

Let's continue the shenanigans, since the numerator looks like the numerator of a beta distribution, we know that it would be a proper beta distribution if we changed the denominator like this:

$\begin{matrix} p (θ ∣ z) & = \frac{θ^{α^{'} - 1} (1 - θ)^{β^{'} - 1}}{B (α^{'}, β^{'})} p (θ ∣ z) & = \frac{θ^{α + z - 1} (1 - θ)^{β - z}}{B (α + z, β - z + 1)} \end{matrix}$

[-]MrGus994y*20

Removed.

[-]Jan Christian Refsgaard4y20

The order does not matter, you can see that by focusing on which is always equal to $\frac{1}{N^{2}}$ , you can also see it from the conjugation rule where you end with $B e t a (3, 2)$ no matter the order.

If you wanted the order to matter you could down weight earlier shots or widen the uncertainty between the updates, so previous posterior becomes a slightly wider prior to capture the extra uncertainty from the passage of time.

[-][anonymous]4y20

I read all the way through, unfortunately this is pretty mathy for me and I don't have much to say. I did like the visualization of updates as squishing the blob around -- my own introduction to updating was via the sequences which didn't have that.

Mine was the same, I became a bayesian statetisian 4 years ago. I gave a talk about Bayesian Statistics and this figure was what made it click to most students (including myself), so i wanted to share it

LESSWRONG
LW

21

a visual explanation of Bayesian updating

21

Bayesian Updating Math

Visual Bayesian Updating

Bayesian Updating using Conjugation

21