# Information theory and the symmetry of updating beliefs

Contents:

1. The beautiful symmetry of Bayesian updating

2. Odds and log odds: a short comparison

3. Further discussion of information

Rationality is all about handling this thing called "information". Fortunately, we live in an era after the rigorous formulation of *Information Theory* by C.E. Shannon in 1948, a basic understanding of which can *actually help you think* about your beliefs, in a way similar but complementary to probability theory. Indeed, it has flourished as an area of research exactly because it helps people in many areas of science to describe the world. We should take advantage of this!

The information theory of *events,* which I'm about to explain, is about as difficult as high school probability. It is certainly easier than the information theory of *multiple random variables *(which right now is explained on Wikipedia), even though the equations look very similar. If you already know it, this can be a linkable source of explanations to save you writing time :)

So! To get started, what better way to motivate information theory than to answer a question about Bayesianism?

**The beautiful symmetry of Bayesian updating**

The factor by which observing A increases the probability of B *is the same* as the factor by which observing B increases the probability of A. This factor is P(A and B)/(P(A)·P(B)), which I'll denote by pev(A,B) for reasons to come. It can vary from 0 to +infinity, and allows us to write Bayes' Theorem succinctly in both directions:

P(A|B)=P(A)·pev(A,B), and P(B|A)=P(B)·pev(A,B)

*What does this symmetry mean, and how should it affect the way we think?*

A great way to think of pev(A,B) is as a multiplicative measure of mutual evidence, which I'll call mutual *probabilistic evidence* to be specific. If pev=1 if they're independent, if pev>1 they make each other more likely, and if pev<1 if they make each other less likely.

But two ways to think are better than one, so I will offer a second explanation, in terms of *information*, which I often find quite helpful in analyzing my own beliefs:

**Probabilistic evidence is related to "mutual information"**

Lets examine a simple example, and work our way up to illustrating what I mean:

Say I flip four fair coins with faces "0" and "1" to generate a 4-bit binary string, X. If I tell you that "X=1???", meaning that the first coin reads "1", this reduces the number of possibilities by ½. We'd like to say here that you've gained "1 bit of information". Suppose instead that I say "X begins with 01 or 10". This has *quantitatively* the same effect, in that it reduces the number of possibilities by ½, so it should also be called "1 bit of information". You might call the first statement an "explicit bit" in that it explicitly specifies a 1 or 0 in the sequence, but this is merely a *qualitative* distinction. For once, we're interested in quantity, not quality.

Now, let A be the event "X=111?" (the event that the first three bits come up "1", and the last bit can be anything), which has probability P(A)=2^{-3}. If A is true but you don't know it, you need to observe exactly 3 independent bits (e.g. the first 3 coins) to confirm it. Intuitively, this is how *uncertain* A is, because it tells us how far away we are
from
confirming A. On the other hand, if I tell you A is true, you now only need 1 more independent bit to specify X=111?, so we can say A has "provided 3 bits" of "information". Intuitively, this is how *informative* A is. These vague ideas nudge us toward the following definition:

**The information value of an event**

We denote and define the *information value* of an event A (aka "surprisal" or "self-information", but not in this post) by the formula

**inf(A) := log _{½}(P(A)) = -log_{2}(P(A))**

which in our example is -log_{2}(2^{-3})= 3 bits, just as we'd like. As was suggested, this quantity has two different intuitive meanings, which by the miracle of logic correspond to the same number inf(A), measured in bits:

1) Uncertainty: How many independent bits are *required* to confirm that A is true.

2) Informativity: How many independent bits are *gained* if we are told that A is true.

Caution: information value is not "data", but rather it is a *number* that can tell you *how uncertain* or *how informative* the data is. Be on the lookout for when "information" means "data", and when it means "information value."

**Mutual information = informational evidence**

Next, let B be the event "X=??11", so P(B)=2^{-2}., and recall that A is the event "X=??11". Both A and B tell us that the third position reads "1", which is independent from the other explicit bits they specify. In this sense, there is 1 bit of "redundancy" in observing both A and B. Notice that A provides 3 bits, B provides 2 bits, but "A and B" together specify that "X=1111" which is only 4 bits, and 3+2-4=1. Thus, we can calculate "redundancy" as

inf(A) + inf(B) - inf(A and B),

which is why this expression is called the *mutual information* of A and B. But wait... taking -log_{2} of *probabilistic evidence* pev(A,B)=P(A and B)/(P(A)·P(B)) yields exactly the same expression! So I'll also call it *informational evidence*, and write

**iev(A,B) := -log _{2}pev(A,B) = inf(A) + inf(B) - inf(A and B)**

While we're at it, lets just take -log_{2} of the rest of Bayes' theorem and see what we get. We can define *conditional information value* by letting

inf(A|B) := -log_{2} P(A|B) = inf(A and B) - inf(B),

and now Bayes' theorem attains the following form:

**inf(A|B) = inf(A) - iev(A,B) ← information theoretic Bayes' Theorem**

In Bayesian updating, A hasn't happened yet, so here let's use our "uncertainty" interpretation of information value. As you can see from the equation, if iev(A,B) is positive, the uncertainty of A *decreases* upon observing B, meaning A becomes *more likely*. If it is negative, the uncertainty of A *increases*, so A becomes *less likely.* It ranges from -infinity to +infinity according as A and B completely contradict or completely confirm each other. In summary:

**Bayesian updating = subtracting mutual evidence from uncertainty.**

This is my other favorite way to think about updating. The fact that evidence can also be thought of as a kind of redundancy or *mutual information* gives a concrete interpretation for the symmetry of belief updating. As well, since "N bits of information" is so easy to conceptualize as a precise quantity, it gives a *quantitative intutive meaning* to "how much A and B support each other". In fact, noticing this is what got me interested in information theory in the first place, and I hope it has piqued your interest, too!

What *kept* me interested is the simple fact that informational evidence behaves so nicely:

(Symmetry) iev(A,B) = iev(B,A)

More examples and discussion to boost your familiarity with information value is provided in Section 3, but for now, lets break for a comparison with two other methods to describe Bayesian updating.

## Odds and log odds: a short comparison.

(I think these deserve a special mention, because they have already been discussed on LessWrong.com.)

Bayes' theorem can also be expressed fairly neatly using odds with likelihood ratios, and log odds with log likelihood-ratios. One shortcoming with using odds when updating are that the likelihood-ratio K(B|A)=P(B|A)/P(B|¬A), sometimes called the Bayes factor, is not symmetric, so it does not make the symmetry of updating obvious. Likewise, log likelihood-ratios are not symmetric either.

But odds and log odds have their advantages. For example, if B_{1} and B_{2} are conditionally independent given A *and* conditionally independent given ¬A, then K(B_{1} and B_{2}, A) = K(B_{1}|A)·K(B_{2}|A), and similarly for any number of B's. These conditions are met naturally when B_{1} and B_{2} are causal consequences of A which do not causally influence each other. By contrast, in causal systems, it is usually **not** the case that pev(A, B_{1} and B_{2}) = pev(A,B_{1})·pev(A,B_{2}). (Reading Pearl's "Causality: Models, Reasoning, and Inference" clarified this for me once and for all, by making precise what a "causal system" is.)

## Further discussion of information

In our excitement to get to an information theoretic Bayes' theorem, we glossed over a lot of opportunities to stop and reflect, so lets do some more of that here.

**Information vs "data" or "knowledge"**

C.E. Shannon originally used the full phrase "information value", but nowadays it is often shortened to "information". As mentioned, information is not a synonym for "data" or "knowledge" when used in this way.

It may be help to analogize this with how "mass" is not "matter". If I place 2 grams of matter on the left side of a balance scale, and 3 grams on the right, it will tip to the right, because 3g-2g=1g>0g. Where is this 1 gram of matter? Which "1 gram of matter" is the matter that tips the scales? The question is meaningless, because the 1g doesn't refer to any matter in particular, just a difference in total amounts. But you *can* ask "how much mass does this matter have?", and likewise "how much information does this data have?".

**Why " the redundant information" doesn't make sense**

When iev(A,B) is positive, we spoke of the mutual information of A and B as "redundancy". But what is this redundant information? What does it say? Again, this is the "information value is data" fallacy making ill-posed questions. It's somewhat like asking which gram of matter should be removed from the scales above in order to balance it. To illustrate more precisely, suppose again that A says "X=111?" and B says "X=??11". If R is the event "X=??1?", it is tempting to call R "*the* mutual information" of A and B. Indeed, if we first observe R, then A and B become independent, so there is no more redundancy. *But this R is not unique.* Any list of 8 outcomes that include the A outcomes and B outcomes would also work this way. For example, we could take R to say "X is one of 0011, 0111, 1011, 1110, 1111, 1000, 0100, 0001".

**To infinity and... well, just to infinity.**

We saw that information value inf(A) ranges from 0 to +infinity, and can be interpreted either as informativity or uncertainty, depending on whether the event has happened or not. Let's think a little about the extremes of this scale:

That 0 information value corresponds to a 100%-likely event means:

1) 0 informativity: you don't gain any information from observing an event that you already knew was certain (ignoring 0%-likely discrepancies), and

2) 0 uncertainty: you don't require any information to verify an event that is certain to occur , and

That +infinity information value corresponds to a 0%-likely event means:

1) infinite uncertainty: no finite amount of information can convince you of a 0%-likely event (though perhaps an infinite series of tests could bring you arbitrarily close), and

2) infinite informativity: if you observe a 0%-likely event, you might win a Nobel prize (it means someone messed up by having a prior belief of 0% somewhere when they shouldn't have).

For the values in between, *more likely = less uncertain = less informative*, and *less likely = more uncertain = more informative.*

**What other cool stuff can happen?**

To get more comfortable with how information values work, let's return to our random 4-bit string X, generated by flipping four coins:

• *Encoding. *Let C be the event "X contains exactly one 1", i.e. X=1000, 0100, 0010, or 0001. This happens with probability 4/16=1/4=2^{-2}, so inf(C) = 2 bits. If C is true, it provides 2 bits of information about X, and using *an additional* 2 bits we could *encode* the position of the "1" by writing "first"=00, "second"=01, "third"=10, and "fourth"=11. Thus we end up using 4 bits in total to specify or "encode" X, as we'd expect. In general, there are theorems characterizing information entirely in terms of encoding/decoding, which is part of what makes it so useful in applications.

• *Negative evidence. *Let D be the event "X starts with 1", which one sees directly as specifying inf(D) = 1 bit of information. It is easy to see that P(D)=1/2 and P(D|C)=1/4, so we know C makes D *less likely* (and vice versa, by update symmetry!), but lets practice thinking in terms of information. Together, "C and D" just means X=1000, so inf(C and D) = 4 bits: it completely determines X. On the other hand, we saw that inf(C)=2 bits, and inf(D)=1 bit, so iev(C,D) = 2+1-4 = -1, confirming that either of them would present negative evidence for the other.

• *Non-integer information values. *Being defined as a logarithm, in real life information values are usually not integers, just like probabilities are not usually simple fractions. This is not actually a problem, but reflects a *flexibility* of the definition. For example, consider the event ¬B: "X does not start with 11", which has probability 3/4, hence inf(¬B)=-log_{2}(3/4) = 0.415. If we also knew "X does not end with 11", that would give us another 0.415 bits of information (since it's independent!). All of our formulae work just fine with non-integer information values, so we can add these to conclude we have 0.830 bits. This being less than 1 means we still haven't constrained the number possibilities as much as knowing a single bit for certain (i.e. 50%). Indeed, 9 out of the 16 possibilities neither start nor end with 11.

**Okay, but is this anything more than a cool trick with logarithms?**

Yes! This definition of information has loads of real-world applications that legitimize it as a scientific *quantity of interest*:

*Communication (bandwidth = information per second),

*Data compression (information = how 'incompressible' the output of a data source is),

*Statistical mechanics and physics (entropy = average uncertainty = expected informativity of observing a system),

and of course, artificial intelligence.

**Where can I read more?**

Eliezer has written a number of posts which involve the information theory of handling multiple random variables at once. So, if you want to learn more about it, Wikipedia is currently a decent source. The general philosophy is to take expected values of the quantities defined here to obtain analogues for random variables, so you're already half-way there.

For something more coherent and in-depth, A Mathematical Theory of Communication, Shannon (1948), which is credited with pioneering modern information theory, impressively remains a fantastic introduction to the subject. Way to go, Shannon!

There's lots more good stuff to learn in that paper alone, but I'll end this post here. What I think is most relevant to LessWrong readers is the awareness of a precise definition of information, and that it can *help you think* about beliefs and Bayesianism.

## Comments (28)

Best*7 points [-]Excellent introduction to information theory! I was a bit concerned about you allowing mutual information to be negative and avoiding the "odds amplification by Bayes factor" version of Bayes's Theorem, but then you went right on and justified those.

To the "further reading" section, I would add the great free textbook on the topic by David MacKay.

Now for some nitpicks: Your informativeness function inf(X) also goes by the standard name surprisal or self-information.

Also, there were more things you could have mentioned for the applications section, like

how it makes the "off-weight tennis ball problem" much easier by leading you to choose weighings that maximize the expected informativeness (aka entropy). (The problem is this: given 12 tennis balls, one of which is heavier or lighter than the rest, find which one is different, and whether it's heavier or lighter, using a balance scale only three times.)

how to provide useless answers to people: You don't just give false information, as that has mutual information with the subject matter; you have to visibly make your answers dependent on something random in order to ensure it's not informative.

*0 points [-]Yes, I'll edit in those synonyms for reference. In defense of "Informativity" and "inf", they're defined for a single event in section 2.3 of this paper:

http://www.springerlink.com/content/m628x0774718237k/

"Information value" is Shannon's term. I just picked the terms I liked best :)

Nice application to the tennis balls problem! I'd heard that a long time ago, but hadn't though about it since I met information theory.

*5 points [-]One year later, in 1949, "A Mathematical Theory of Communication" was reprinted as "

TheMathematical Theory of Communication". Which shows how rapidly people recognized that this came from The Book.There is a sign problem when iev(A,B) is defined. You mention that you can get the mutual information of A and B by taking -log_2 of probabilistic evidence pev(A,B) = P(AB) / [P(A)P(B)], but this creates an extra negative sign:

-log

2(pev(A,B)) = -log2[P(AB) / [P(A)P(B)]] = -[ log2(P(AB)) - [log2(P(A)) + log2(P(B))] ] = -log2(P(AB)) + log2(P(A)) + log2(P(B)) = inf(AB) - inf(A) - inf(B).No. Let A be a proposition we believe initially with probability 1/2. Let B1 and B2 be independent tests of A, such that if you know A or if you know not A, then knowing B1 does not give you any additional information about B2. Suppose B1 and B2 both have probability 9/10 conditional on A, and 1/10 conditional on not A. Then Pev(A,B1) = Pev(A,B2) = 9/5, and the product Pev(A,B1)*Pev(A,B2) = 81/25. But Pev(A,B1 and B2)= 81/41.

But, if you take the likelihood ratios, then L(A|B1) = L(A|B2) = 9, so L(A|B1)*L(A|B2) = 81, and L(A|B1 and B2) = 81.

The reason this works for likelihood ratios, but not mutual evidence, is that L(A|B) does not depend on P(A), like Pev(A,B) does. The problem with multiplying Pev's is that you have to update P(A) first, that is, you really should multiply P(A,B1|X) by P(A,B2|B1 and X).

*0 points [-]Ah, in the multiplicativity/additivity results I should have added "conditionally independent on A". I just made the correction, many thanks.

The proof should be easy to see now that it's stated correctly: you write down the conclusion pev(A,BC)=pev(A,B)·pev(A,C) and observe that it is the quotient of the independence assumptions,

P(BC|A) = P(B|A)·P(C|A) and

P(BC) = P(B)·P(C)

I've also added mention of the independence conditions under which likelihood ratios do multiply; I think the comparison is better this way. Cheers!

Given these assumptions, your theorem holds. But, the condition P(BC) = P(B)·P(C) is problematic. An important point you are glossing over is that in epistemological probability theory, all probabilities are conditional probabilities, and the condition is more formally P(BC|X) = P(B|X)*P(C|X), where X represents a state of belief. The problem is that, since B and C both depend on A, the condition is unstable with respect to P(A|X). While it is possible to come up with contrived examples of A, B, C, and X where this works, it is not the sort of thing I would expect to come up naturally and be actually useful in implementing an epistemology in the real world.

Contrasting with likelihood ratios, the example I gave earlier could represent the situation where you know a coin is biased to land on one side 9/10 of the time when flipped, but you don't know which, A represents that it is biased to show heads, and B1 and B2 represent that it does show heads in two separate trial flips.

*0 points [-]Isn't this equally a problem for likelihoods? If I understand you, the "independent tests" assumptions for likelihood-ratio multiplying are also "unstable with respect to P(A|X)". Explicitly, I mean

(2) P(BC|XA) = P(B|XA)·P(C|XA) and P(BC|X~A) = P(B|XA)·P(C|X~A).

Even if P(A|X) stays constant, "changing X", i.e. making a new observation D, can change

allof these assumptions, no?If I've misunderstood, could you please explain in full detail

(0) What you mean by "unstable with respect to P(A|X)",

(1) How the "before and after" assumptions are "unstable", and

(2) How the "independent tests" assumptions are "stable"?

Even if this isn't it, I'm very curious to know if there are epistemological reasons to favor assumptions (1) over assumptions (2)...

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X). When you assume that A is true (or when you assume A is false), you no longer need to consider the probability of A when figuring out the conclusions of that assumption.

Yes, you can come up with scenarios where (2) does not apply, and you cannot multiply likelihood ratios. My point was that there are plausible scenarios where it does apply, and this sort of thing actually happens in real life.

A condition is unstable with respect to R if it can be true for a certain value of R, but this does not imply that it will be true for other values of R.

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

As I stated earlier, the condition A or the condition ~A screens off P(A|X).

I would have expected this to be clear if you had applied your questions to my example, and tried to construct a corresponding example in which Pev's can be multiplied.

That your prose is not accompanied by any equations makes it very effortful to filter all its possible meanings to find the intelligent one you no doubt intend, and impractical to respond. I might disagree that

but I can't be sure, and I have nothing precise to refer back to. (You don't have to oblige me, of course.)

*1 point [-]First, that is not a response to your quote from me. In that quote I was in fact pointing to elements of an equation.

By "before" independence, we have

Decompose B into BA or ~AB

Apply "after" independence to (3)

Combing (1) and (4)

Solve for P(B|~AC)

Use Bayes' Law substitute out conditional probabilities of A

I believe that (7) qualifies as a complicated dependence given ~A, which itself depends on P(A). This is just the result of my exploration of the issue, but you asked for it.

Also, I would like to reiterate:

*0 points [-]Here's a simple example where pev(A,BC)=pev(A,B)pev(A,C):

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(BC)=1/16, P(ABC)=1/64.

Then just suppose there are some medical diagnostics with these probabilities, etc.

But see belowfor a comparison of this with your coin scenario.More generally, let a=P(A)=1/4, b=P(B), c=P(C), x=P(BC), y=P(AC), z=P(AB), t=P(ABC).

The "before and after independence" assumption [BAI] is that x=bc and at=yz

The "independent tests" assumption [IT] is that (x-t)=(b-z)(c-y) and at=yz.

Neither [BAI] nor [IT] is stable with respect to the variable

a=P(A), and I see no equation involving anything I'd call "screening off", though there might be one somewhere.In any case, that does not interest me, because

your equation (7) has my attention:A qualitative distinction between my would-be medical scenario [EG1] and your coin scenario is that medical diagnostics, and particularly their dependence/independence, are not always "causally justified", but two coin flips can be seen as independent because they visibly just don't interact.

I bet your (7) gives a good description of this somehow, or something closely related... But I still have to formalize my thoughts about this.

Let me think about it awhile, and I'll post again if I understand or wonder something more precise.

*1 point [-]I assume you meant:

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(

AB)=1/16, P(ABC)=1/64.You just glossed over the whole point of this exercise. The problem is that values such as P(ABC) are combined facts about the population and both tests. Try defining the scenario using only facts about the population in isolation (P(A)), about the tests in isolation (P(B|A), P(C|A), P(B|~A), P(C|~A)), and the dependence between the tests (P(B|AC), P(B|A~C), P(B|~AC), P(B|~A~C), [EDIT: removed redundant terms, I got carried away when typing out the permutations]). The point is to demonstrate how you have to contrive certain properties of the dependence between the tests, the conditional probabilities given ~A, to make summary properties you care about work out for a specific separate fact about the population, P(A).

*2 points [-]Thomas and Cover is a good, readable book on info theory.

My favorite neat info theory puzzle asks how can you force a source of biased coin flips to give you unbiased coin flips instead, even if you don't know the bias of the coin. (The glib answer is to zip the bits you get, a more precise answer is that that's what universal source codes do, if the bit sequence is long enough they will compress to entropy, regardless of what distribution generated the bits. It's actually kind of surprising and weird that universal source codes are a thing).

Info theorists reinvented causality in the 90s (the academic coordination problem is hard...):

http://arxiv.org/pdf/1110.0718v1.pdf

(and lots of papers on channels with feedback).

Ain't that the truth, brother.

This article gives me a better perspective as to why scientific journals like to publish surprising results rather than unsurprising ones -- they are more informative, for the very reason they are likelier to be wrong.

Yes... if a theory adds to the surprisal of an experimental result, then the experimental result adds precisely the same amount of the surprisal of the theory. That's interesting.

Hmm. An old book that we should read to learn science...

Typo: Next, let B be the event "X=??11", so P(B)=2 -2 ., and recall that A is the event "X=??11"., A should be X=111?

Interesting Log-Odds paper by Brian Lee and Jacob Sanders, November 2011.

*3 points [-]It would be interesting to know if they have any particular justification for picking the natural logarithm or if it was simply a matter of using what first came to mind. Other likely candidates seem to be:

Yeah, as a software engineer, I can estimate log base 2 in my head, so that would be much more natural for me. But I could also see an argument for decibels since it's a unit that's already in use for other things.

Helpfully, log scales are off from each other by a constant factor, so whatever techniques they use ln for can be used just as well for log base 2. Just update any constants to the correct (much prettier) values.

Much like how inches and centimeters are off by a constant factor. Different log bases are analogous to different units.

*1 point [-]Roll on the

log oddsstandardisation effort - by the sound of it!This looks a wonderful post. I think I am going to get a notebook and go through it all step-by-step. May have some questions.

*2 points [-]Okay, so let's take an example. Suppose there's a disease with prevalence P(D) = 6%, and a test with true positive rate P(+|D) = 90% and false positive rate P(+|~D) = 10%.

We have seen a positive result on the test.

We take the information-theoretic version of Bayes theorem:

inf(D|+) = inf(D) - iev(D,+)

inf(D|+) = log2[0.148] - log2[0.9*0.06] ~= 1.45 bits (= 36%)

Now suppose the prevalence of the disease was 70%; then we find

inf(D|+) ~= 0.07 bits (= 95%)

Which makes sense, because the second test merely

confirmed what was already likely; hence it is less informative (but even the first is not terribly, due to low prior).Yeah, I can definitely see the appeal of this method. Great post, thanks!

An even more convenient example:

It works! That was fun.