Information theory and the symmetry of updating beliefs

Academian

Contents:

1.  The beautiful symmetry of Bayesian updating
2.  Odds and log odds: a short comparison
3.  Further discussion of information

Rationality is all about handling this thing called "information". Fortunately, we live in an era after the rigorous formulation of Information Theory by C.E. Shannon in 1948, a basic understanding of which can actually help you think about your beliefs, in a way similar but complementary to probability theory. Indeed, it has flourished as an area of research exactly because it helps people in many areas of science to describe the world. We should take advantage of this!

The information theory of events, which I'm about to explain, is about as difficult as high school probability. It is certainly easier than the information theory of multiple random variables (which right now is explained on Wikipedia), even though the equations look very similar. If you already know it, this can be a linkable source of explanations to save you writing time :)

So! To get started, what better way to motivate information theory than to answer a question about Bayesianism?

The beautiful symmetry of Bayesian updating

The factor by which observing A increases the probability of B is the same as the factor by which observing B increases the probability of A. This factor is P(A and B)/(P(A)·P(B)), which I'll denote by pev(A,B) for reasons to come. It can vary from 0 to +infinity, and allows us to write Bayes' Theorem succinctly in both directions:

P(A|B)=P(A)·pev(A,B), and P(B|A)=P(B)·pev(A,B)

What does this symmetry mean, and how should it affect the way we think?

A great way to think of pev(A,B) is as a multiplicative measure of mutual evidence, which I'll call mutual probabilistic evidence to be specific. If pev=1 if they're independent, if pev>1 they make each other more likely, and if pev<1 if they make each other less likely.

But two ways to think are better than one, so I will offer a second explanation, in terms of information, which I often find quite helpful in analyzing my own beliefs:

Probabilistic evidence is related to "mutual information"

Lets examine a simple example, and work our way up to illustrating what I mean:

Say I flip four fair coins with faces "0" and "1" to generate a 4-bit binary string, X. If I tell you that "X=1???", meaning that the first coin reads "1", this reduces the number of possibilities by ½. We'd like to say here that you've gained "1 bit of information". Suppose instead that I say "X begins with 01 or 10". This has quantitatively the same effect, in that it reduces the number of possibilities by ½, so it should also be called "1 bit of information". You might call the first statement an "explicit bit" in that it explicitly specifies a 1 or 0 in the sequence, but this is merely a qualitative distinction. For once, we're interested in quantity, not quality.

Now, let A be the event "X=111?" (the event that the first three bits come up "1", and the last bit can be anything), which has probability P(A)=2^-3. If A is true but you don't know it, you need to observe exactly 3 independent bits (e.g. the first 3 coins) to confirm it. Intuitively, this is how uncertain A is, because it tells us how far away we are from confirming A. On the other hand, if I tell you A is true, you now only need 1 more independent bit to specify X=111?, so we can say A has "provided 3 bits" of "information". Intuitively, this is how informative A is. These vague ideas nudge us toward the following definition:

The information value of an event

We denote and define the information value of an event A (aka "surprisal" or "self-information", but not in this post) by the formula

inf(A) := log_½(P(A)) = -log₂(P(A))

which in our example is -log₂(2^-3)= 3 bits, just as we'd like. As was suggested, this quantity has two different intuitive meanings, which by the miracle of logic correspond to the same number inf(A), measured in bits:

1) Uncertainty: How many independent bits are required to confirm that A is true.

2) Informativity: How many independent bits are gained if we are told that A is true.

Caution: information value is not "data", but rather it is a number that can tell you how uncertain or how informative the data is. Be on the lookout for when "information" means "data", and when it means "information value."

Mutual information = informational evidence

Next, let B be the event "X=??11", so P(B)=2^-2., and recall that A is the event "X=??11". Both A and B tell us that the third position reads "1", which is independent from the other explicit bits they specify. In this sense, there is 1 bit of "redundancy" in observing both A and B. Notice that A provides 3 bits, B provides 2 bits, but "A and B" together specify that "X=1111" which is only 4 bits, and 3+2-4=1. Thus, we can calculate "redundancy" as

inf(A) + inf(B) - inf(A and B),

which is why this expression is called the mutual information of A and B. But wait... taking -log₂ of probabilistic evidence pev(A,B)=P(A and B)/(P(A)·P(B)) yields exactly the same expression! So I'll also call it informational evidence, and write

iev(A,B) := -log₂pev(A,B) = inf(A) + inf(B) - inf(A and B)

While we're at it, lets just take -log₂ of the rest of Bayes' theorem and see what we get. We can define conditional information value by letting

inf(A|B) := -log₂ P(A|B) = inf(A and B) - inf(B),

and now Bayes' theorem attains the following form:

inf(A|B) = inf(A) - iev(A,B) ← information theoretic Bayes' Theorem

In Bayesian updating, A hasn't happened yet, so here let's use our "uncertainty" interpretation of information value. As you can see from the equation, if iev(A,B) is positive, the uncertainty of A decreases upon observing B, meaning A becomes more likely. If it is negative, the uncertainty of A increases, so A becomes less likely. It ranges from -infinity to +infinity according as A and B completely contradict or completely confirm each other. In summary:

Bayesian updating = subtracting mutual evidence from uncertainty.

This is my other favorite way to think about updating. The fact that evidence can also be thought of as a kind of redundancy or mutual information gives a concrete interpretation for the symmetry of belief updating. As well, since "N bits of information" is so easy to conceptualize as a precise quantity, it gives a quantitative intutive meaning to "how much A and B support each other". In fact, noticing this is what got me interested in information theory in the first place, and I hope it has piqued your interest, too!

What kept me interested is the simple fact that informational evidence behaves so nicely:

(Symmetry) iev(A,B) = iev(B,A)

More examples and discussion to boost your familiarity with information value is provided in Section 3, but for now, lets break for a comparison with two other methods to describe Bayesian updating.

Odds and log odds: a short comparison.

(I think these deserve a special mention, because they have already been discussed on LessWrong.com.)

Bayes' theorem can also be expressed fairly neatly using odds with likelihood ratios, and log odds with log likelihood-ratios. One shortcoming with using odds when updating are that the likelihood-ratio K(B|A)=P(B|A)/P(B|¬A), sometimes called the Bayes factor, is not symmetric, so it does not make the symmetry of updating obvious. Likewise, log likelihood-ratios are not symmetric either.

But odds and log odds have their advantages. For example, if B₁ and B₂ are conditionally independent given A and conditionally independent given ¬A, then K(B₁ and B₂, A) = K(B₁|A)·K(B₂|A), and similarly for any number of B's. These conditions are met naturally when B₁ and B₂ are causal consequences of A which do not causally influence each other. By contrast, in causal systems, it is usually not the case that pev(A, B₁ and B₂) = pev(A,B₁)·pev(A,B₂). (Reading Pearl's "Causality: Models, Reasoning, and Inference" clarified this for me once and for all, by making precise what a "causal system" is.)

Further discussion of information

In our excitement to get to an information theoretic Bayes' theorem, we glossed over a lot of opportunities to stop and reflect, so lets do some more of that here.

Information vs "data" or "knowledge"

C.E. Shannon originally used the full phrase "information value", but nowadays it is often shortened to "information". As mentioned, information is not a synonym for "data" or "knowledge" when used in this way.

It may be help to analogize this with how "mass" is not "matter". If I place 2 grams of matter on the left side of a balance scale, and 3 grams on the right, it will tip to the right, because 3g-2g=1g>0g. Where is this 1 gram of matter? Which "1 gram of matter" is the matter that tips the scales? The question is meaningless, because the 1g doesn't refer to any matter in particular, just a difference in total amounts. But you can ask "how much mass does this matter have?", and likewise "how much information does this data have?".

Why "the redundant information" doesn't make sense

When iev(A,B) is positive, we spoke of the mutual information of A and B as "redundancy". But what is this redundant information? What does it say? Again, this is the "information value is data" fallacy making ill-posed questions. It's somewhat like asking which gram of matter should be removed from the scales above in order to balance it. To illustrate more precisely, suppose again that A says "X=111?" and B says "X=??11". If R is the event "X=??1?", it is tempting to call R "the mutual information" of A and B. Indeed, if we first observe R, then A and B become independent, so there is no more redundancy. But this R is not unique. Any list of 8 outcomes that include the A outcomes and B outcomes would also work this way. For example, we could take R to say "X is one of 0011, 0111, 1011, 1110, 1111, 1000, 0100, 0001".

To infinity and... well, just to infinity.

We saw that information value inf(A) ranges from 0 to +infinity, and can be interpreted either as informativity or uncertainty, depending on whether the event has happened or not. Let's think a little about the extremes of this scale:

That 0 information value corresponds to a 100%-likely event means:

1) 0 informativity: you don't gain any information from observing an event that you already knew was certain (ignoring 0%-likely discrepancies), and

2) 0 uncertainty: you don't require any information to verify an event that is certain to occur , and

That +infinity information value corresponds to a 0%-likely event means:

1) infinite uncertainty: no finite amount of information can convince you of a 0%-likely event (though perhaps an infinite series of tests could bring you arbitrarily close), and

2) infinite informativity: if you observe a 0%-likely event, you might win a Nobel prize (it means someone messed up by having a prior belief of 0% somewhere when they shouldn't have).

For the values in between, more likely = less uncertain = less informative, and less likely = more uncertain = more informative.

What other cool stuff can happen?

To get more comfortable with how information values work, let's return to our random 4-bit string X, generated by flipping four coins:

• Encoding. Let C be the event "X contains exactly one 1", i.e. X=1000, 0100, 0010, or 0001. This happens with probability 4/16=1/4=2^-2, so inf(C) = 2 bits. If C is true, it provides 2 bits of information about X, and using an additional 2 bits we could encode the position of the "1" by writing "first"=00, "second"=01, "third"=10, and "fourth"=11. Thus we end up using 4 bits in total to specify or "encode" X, as we'd expect. In general, there are theorems characterizing information entirely in terms of encoding/decoding, which is part of what makes it so useful in applications.

• Negative evidence. Let D be the event "X starts with 1", which one sees directly as specifying inf(D) = 1 bit of information. It is easy to see that P(D)=1/2 and P(D|C)=1/4, so we know C makes D less likely (and vice versa, by update symmetry!), but lets practice thinking in terms of information. Together, "C and D" just means X=1000, so inf(C and D) = 4 bits: it completely determines X. On the other hand, we saw that inf(C)=2 bits, and inf(D)=1 bit, so iev(C,D) = 2+1-4 = -1, confirming that either of them would present negative evidence for the other.

• Non-integer information values. Being defined as a logarithm, in real life information values are usually not integers, just like probabilities are not usually simple fractions. This is not actually a problem, but reflects a flexibility of the definition. For example, consider the event ¬B: "X does not start with 11", which has probability 3/4, hence inf(¬B)=-log₂(3/4) = 0.415. If we also knew "X does not end with 11", that would give us another 0.415 bits of information (since it's independent!). All of our formulae work just fine with non-integer information values, so we can add these to conclude we have 0.830 bits. This being less than 1 means we still haven't constrained the number possibilities as much as knowing a single bit for certain (i.e. 50%). Indeed, 9 out of the 16 possibilities neither start nor end with 11.

Okay, but is this anything more than a cool trick with logarithms?

Yes! This definition of information has loads of real-world applications that legitimize it as a scientific quantity of interest:

*Communication (bandwidth = information per second),

*Data compression (information = how 'incompressible' the output of a data source is),

*Statistical mechanics and physics (entropy = average uncertainty = expected informativity of observing a system),

and of course, artificial intelligence.

Where can I read more?

Eliezer has written a number of posts which involve the information theory of handling multiple random variables at once. So, if you want to learn more about it, Wikipedia is currently a decent source. The general philosophy is to take expected values of the quantities defined here to obtain analogues for random variables, so you're already half-way there.

For something more coherent and in-depth, A Mathematical Theory of Communication, Shannon (1948), which is credited with pioneering modern information theory, impressively remains a fantastic introduction to the subject. Way to go, Shannon!

There's lots more good stuff to learn in that paper alone, but I'll end this post here. What I think is most relevant to LessWrong readers is the awareness of a precise definition of information, and that it can help you think about beliefs and Bayesianism.

Excellent introduction to information theory! I was a bit concerned about you allowing mutual information to be negative and avoiding the "odds amplification by Bayes factor" version of Bayes's Theorem, but then you went right on and justified those.

To the "further reading" section, I would add the great free textbook on the topic by David MacKay.

Now for some nitpicks: Your informativeness function inf(X) also goes by the standard name surprisal or self-information.

Also, there were more things you could have mentioned for the applications section, like

how it makes the "off-weight tennis ball problem" much easier by leading you to choose weighings that maximize the expected informativeness (aka entropy). (The problem is this: given 12 tennis balls, one of which is heavier or lighter than the rest, find which one is different, and whether it's heavier or lighter, using a balance scale only three times.)
how to provide useless answers to people: You don't just give false information, as that has mutual information with the subject matter; you have to visibly make your answers dependent on something random in order to ensure it's not informative.

Yes, I'll edit in those synonyms for reference. In defense of "Informativity" and "inf", they're defined for a single event in section 2.3 of this paper:

http://www.springerlink.com/content/m628x0774718237k/

"Information value" is Shannon's term. I just picked the terms I liked best :)

Nice application to the tennis balls problem! I'd heard that a long time ago, but hadn't though about it since I met information theory.

For something more coherent and in-depth, A Mathematical Theory of Communication, Shannon (1948), which is credited with pioneering modern information theory, impressively remains a fantastic introduction to the subject.

One year later, in 1949, "A Mathematical Theory of Communication" was reprinted as "The Mathematical Theory of Communication". Which shows how rapidly people recognized that this came from The Book.

Thomas and Cover is a good, readable book on info theory.

My favorite neat info theory puzzle asks how can you force a source of biased coin flips to give you unbiased coin flips instead, even if you don't know the bias of the coin. (The glib answer is to zip the bits you get, a more precise answer is that that's what universal source codes do, if the bit sequence is long enough they will compress to entropy, regardless of what distribution generated the bits. It's actually kind of surprising and weird that universal source codes are a thing).

Info theorists reinvented causality in the 90s (the academic coordination problem is hard...):

http://arxiv.org/pdf/1110.0718v1.pdf

(and lots of papers on channels with feedback).

(the academic coordination problem is hard...)

Ain't that the truth, brother.

There is a sign problem when iev(A,B) is defined. You mention that you can get the mutual information of A and B by taking -log_2 of probabilistic evidence pev(A,B) = P(AB) / [P(A)P(B)], but this creates an extra negative sign:

-log_2(pev(A,B)) = -log_2[P(AB) / [P(A)P(B)]] = -[ log_2(P(AB)) - [log_2(P(A)) + log_2(P(B))] ] = -log_2(P(AB)) + log_2(P(A)) + log_2(P(B)) = inf(AB) - inf(A) - inf(B).

A great way to think of pev(A,B) is as a multiplicative measure of mutual evidence, which I'll call mutual probabilistic evidence to be specific. If pev=1 if they're independent, if pev>1 they make each other more likely, and if pev<1 if they make each other less likely. We can say things like probabilistic evidence from independent events can be multiplied.

No. Let A be a proposition we believe initially with probability 1/2. Let B1 and B2 be independent tests of A, such that if you know A or if you know not A, then knowing B1 does not give you any additional information about B2. Suppose B1 and B2 both have probability 9/10 conditional on A, and 1/10 conditional on not A. Then Pev(A,B1) = Pev(A,B2) = 9/5, and the product Pev(A,B1)*Pev(A,B2) = 81/25. But Pev(A,B1 and B2)= 81/41.

The reason this works for likelihood ratios, but not mutual evidence, is that L(A|B) does not depend on P(A), like Pev(A,B) does. The problem with multiplying Pev's is that you have to update P(A) first, that is, you really should multiply P(A,B1|X) by P(A,B2|B1 and X).

Ah, in the multiplicativity/additivity results I should have added "conditionally independent on A". I just made the correction, many thanks.

The proof should be easy to see now that it's stated correctly: you write down the conclusion pev(A,BC)=pev(A,B)·pev(A,C) and observe that it is the quotient of the independence assumptions,

P(BC|A) = P(B|A)·P(C|A) and

P(BC) = P(B)·P(C)

I've also added mention of the independence conditions under which likelihood ratios do multiply; I think the comparison is better this way. Cheers!

Given these assumptions, your theorem holds. But, the condition P(BC) = P(B)·P(C) is problematic. An important point you are glossing over is that in epistemological probability theory, all probabilities are conditional probabilities, and the condition is more formally P(BC|X) = P(B|X)*P(C|X), where X represents a state of belief. The problem is that, since B and C both depend on A, the condition is unstable with respect to P(A|X). While it is possible to come up with contrived examples of A, B, C, and X where this works, it is not the sort of thing I would expect to come up naturally and be actually useful in implementing an epistemology in the real world.

Contrasting with likelihood ratios, the example I gave earlier could represent the situation where you know a coin is biased to land on one side 9/10 of the time when flipped, but you don't know which, A represents that it is biased to show heads, and B1 and B2 represent that it does show heads in two separate trial flips.

X represents a state of belief .... The problem is that, since B and C both depend on A, the condition (1) [P(BC|X) = P(B|X)·P(C|X)] is unstable with respect to P(A|X)

Isn't this equally a problem for likelihoods? If I understand you, the "independent tests" assumptions for likelihood-ratio multiplying are also "unstable with respect to P(A|X)". Explicitly, I mean

Even if P(A|X) stays constant, "changing X", i.e. making a new observation D, can change all of these assumptions, no?

If I've misunderstood, could you please explain in full detail

(0) What you mean by "unstable with respect to P(A|X)",

(1) How the "before and after" assumptions are "unstable", and

(2) How the "independent tests" assumptions are "stable"?

Even if this isn't it, I'm very curious to know if there are epistemological reasons to favor assumptions (1) over assumptions (2)...

Isn't this equally a problem for likelihoods? If I understand you, the "independent tests" assumptions for likelihood-ratio multiplying are also "unstable with respect to P(A|X)". Explicitly, I mean

(2) P(BC|XA) = P(B|XA)·P(C|XA) and P(BC|X~A) = P(B|X~[correction]A)·P(C|X~A).

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X). When you assume that A is true (or when you assume A is false), you no longer need to consider the probability of A when figuring out the conclusions of that assumption.

Even if P(A|X) stays constant, "changing X", i.e. making a new observation D, can change all of these assumptions, no?

Yes, you can come up with scenarios where (2) does not apply, and you cannot multiply likelihood ratios. My point was that there are plausible scenarios where it does apply, and this sort of thing actually happens in real life.

(0) What you mean by "unstable with respect to P(A|X)"

A condition is unstable with respect to R if it can be true for a certain value of R, but this does not imply that it will be true for other values of R.

(1) How the "before and after" assumptions are "unstable",

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

(2) How the "independent tests" assumptions are "stable"?

As I stated earlier, the condition A or the condition ~A screens off P(A|X).

I would have expected this to be clear if you had applied your questions to my example, and tried to construct a corresponding example in which Pev's can be multiplied.

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X)

That your prose is not accompanied by any equations makes it very effortful to filter all its possible meanings to find the intelligent one you no doubt intend, and impractical to respond. I might disagree that

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

but I can't be sure, and I have nothing precise to refer back to. (You don't have to oblige me, of course.)

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X)

That your prose is not accompanied by any equations makes it very effortful to filter all its possible meanings to find the intelligent one you no doubt intend, and impractical to respond.

First, that is not a response to your quote from me. In that quote I was in fact pointing to elements of an equation.

I might disagree that

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

but I can't be sure, and I have nothing precise to refer back to. (You don't have to oblige me, of course.)

By "before" independence, we have

(1) P(B) = P(B|C).

Decompose B into BA or ~AB

(2) P(B|C) = P(AB|C) + P(~AB|C)
(3) P(B|C) = P(A|C)P(B|AC) + P(~A|C)P(B|~AC)

Apply "after" independence to (3)

(4) P(B|C) = P(A|C)P(B|A) + P(~A|C)P(B|~AC)

Combing (1) and (4)

(5) P(B) = P(A|C)P(B|A) + P(~A|C)P(B|~AC)

Solve for P(B|~AC)

(6) P(B|~AC) = (P(B) - P(A|C)P(B|A))/P(~A|C)

Use Bayes' Law substitute out conditional probabilities of A

(7) P(B|~AC) = (P(B) - P(A)(P(C|A)/P(C))P(B|A))/(P(~A)P(C|~A)/P(C))

I believe that (7) qualifies as a complicated dependence given ~A, which itself depends on P(A). This is just the result of my exploration of the issue, but you asked for it.

Also, I would like to reiterate:

I would have expected this to be clear if you had applied your questions to my example, and tried to construct a corresponding example in which Pev's can be multiplied.

Here's a simple example where pev(A,BC)=pev(A,B)pev(A,C):

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(BC)=1/16, P(ABC)=1/64.

Then just suppose there are some medical diagnostics with these probabilities, etc. But see below for a comparison of this with your coin scenario.

More generally, let a=P(A)=1/4, b=P(B), c=P(C), x=P(BC), y=P(AC), z=P(AB), t=P(ABC).

The "before and after independence" assumption [BAI] is that x=bc and at=yz

The "independent tests" assumption [IT] is that (x-t)=(b-z)(c-y) and at=yz.

(2) How the "independent tests" assumptions are "stable"?

As I stated earlier, the condition A or the condition ~A screens off P(A|X).

Neither [BAI] nor [IT] is stable with respect to the variable a=P(A), and I see no equation involving anything I'd call "screening off", though there might be one somewhere.

In any case, that does not interest me, because your equation (7) has my attention:

(7) P(B|~AC) = (P(B) - P(A)(P(C|A)/P(C))P(B|A))/(P(~A)P(C|~A)/P(C))

A qualitative distinction between my would-be medical scenario [EG1] and your coin scenario is that medical diagnostics, and particularly their dependence/independence, are not always "causally justified", but two coin flips can be seen as independent because they visibly just don't interact.

I bet your (7) gives a good description of this somehow, or something closely related... But I still have to formalize my thoughts about this.

Let me think about it awhile, and I'll post again if I understand or wonder something more precise.

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(BC)=1/16, P(ABC)=1/64.

I assume you meant:

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(AB)=1/16, P(ABC)=1/64.

Then just suppose there are some medical diagnostics with these probabilities, etc.

You just glossed over the whole point of this exercise. The problem is that values such as P(ABC) are combined facts about the population and both tests. Try defining the scenario using only facts about the population in isolation (P(A)), about the tests in isolation (P(B|A), P(C|A), P(B|~A), P(C|~A)), and the dependence between the tests (P(B|AC), P(B|A~C), P(B|~AC), P(B|~A~C), [EDIT: removed redundant terms, I got carried away when typing out the permutations]). The point is to demonstrate how you have to contrive certain properties of the dependence between the tests, the conditional probabilities given ~A, to make summary properties you care about work out for a specific separate fact about the population, P(A).

This article gives me a better perspective as to why scientific journals like to publish surprising results rather than unsurprising ones -- they are more informative, for the very reason they are likelier to be wrong.

Yes... if a theory adds to the surprisal of an experimental result, then the experimental result adds precisely the same amount of the surprisal of the theory. That's interesting.

Typo: Next, let B be the event "X=??11", so P(B)=2 -2 ., and recall that A is the event "X=??11"., A should be X=111?

Interesting Log-Odds paper by Brian Lee and Jacob Sanders, November 2011.

It would be interesting to know if they have any particular justification for picking the natural logarithm or if it was simply a matter of using what first came to mind. Other likely candidates seem to be:

log base 2 (a.k.a. bits)
10 * log base 10 (a.k.a. decibels)
log base 10

Roll on the log odds standardisation effort - by the sound of it!

Yeah, as a software engineer, I can estimate log base 2 in my head, so that would be much more natural for me. But I could also see an argument for decibels since it's a unit that's already in use for other things.

Helpfully, log scales are off from each other by a constant factor, so whatever techniques they use ln for can be used just as well for log base 2. Just update any constants to the correct (much prettier) values.

Much like how inches and centimeters are off by a constant factor. Different log bases are analogous to different units.

Hi - I am just puzzled why inf(A|B) = inf(A) - iev(A, B).

Here is my derivation: inf(A|B) = - log(P(A|B)) = - log(P(A)*pev(A, B)) = -log(P(A)) - log(pev(A, B)) = inf(A) + iev(A, B).

Could you please tell me where is wrong? Thank you very much!

This looks a wonderful post. I think I am going to get a notebook and go through it all step-by-step. May have some questions.

Okay, so let's take an example. Suppose there's a disease with prevalence P(D) = 6%, and a test with true positive rate P(+|D) = 90% and false positive rate P(+|~D) = 10%.

We have seen a positive result on the test.

We take the information-theoretic version of Bayes theorem:

inf(D|+) = inf(D) - iev(D,+)

         = inf(D) - inf(D) - inf(+) + inf(D and +)

         = -inf(+) + inf(rate of TRUE positives)

         = log2[ P(+|D)P(D) + P(+|~D)P(~D) ] - log2[P(+|D)P(D)]

inf(D|+) = log2[0.148] - log2[0.9*0.06] ~= 1.45 bits (= 36%)

Now suppose the prevalence of the disease was 70%; then we find

inf(D|+) ~= 0.07 bits (= 95%)

Which makes sense, because the second test merely confirmed what was already likely; hence it is less informative (but even the first is not terribly, due to low prior).

Yeah, I can definitely see the appeal of this method. Great post, thanks!

An even more convenient example:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

inf(C) = -log2(0.01) = 6.64
iev(+,C) = inf(+) + inf(C) - inf(+ & C)
         = - log2(0.01*0.8+0.99*0.096) - log2(0.01) + log2(0.01*0.8)
         = 3.28 + 6.64 - 6.97 = 2.95
inf(C|+) = inf(C) - iev(+,C) = 6.64 - 2.95 = 3.69
P(C|+) = 2^(-3.69) = 7.7%

It works! That was fun.

Hmm. An old book that we should read to learn science...

To the "further reading" section, I would add the great free textbook on the topic by David MacKay.

Now for some nitpicks: Your informativeness function inf(X) also goes by the standard name surprisal or self-information.

Also, there were more things you could have mentioned for the applications section, like

how it makes the "off-weight tennis ball problem" much easier by leading you to choose weighings that maximize the expected informativeness (aka entropy). (The problem is this: given 12 tennis balls, one of which is heavier or lighter than the rest, find which one is different, and whether it's heavier or lighter, using a balance scale only three times.)
how to provide useless answers to people: You don't just give false information, as that has mutual information with the subject matter; you have to visibly make your answers dependent on something random in order to ensure it's not informative.

Yes, I'll edit in those synonyms for reference. In defense of "Informativity" and "inf", they're defined for a single event in section 2.3 of this paper:

http://www.springerlink.com/content/m628x0774718237k/

"Information value" is Shannon's term. I just picked the terms I liked best :)

Nice application to the tennis balls problem! I'd heard that a long time ago, but hadn't though about it since I met information theory.

For something more coherent and in-depth, A Mathematical Theory of Communication, Shannon (1948), which is credited with pioneering modern information theory, impressively remains a fantastic introduction to the subject.

Thomas and Cover is a good, readable book on info theory.

Info theorists reinvented causality in the 90s (the academic coordination problem is hard...):

http://arxiv.org/pdf/1110.0718v1.pdf

(and lots of papers on channels with feedback).

(the academic coordination problem is hard...)

Ain't that the truth, brother.

-log_2(pev(A,B)) = -log_2[P(AB) / [P(A)P(B)]] = -[ log_2(P(AB)) - [log_2(P(A)) + log_2(P(B))] ] = -log_2(P(AB)) + log_2(P(A)) + log_2(P(B)) = inf(AB) - inf(A) - inf(B).

A great way to think of pev(A,B) is as a multiplicative measure of mutual evidence, which I'll call mutual probabilistic evidence to be specific. If pev=1 if they're independent, if pev>1 they make each other more likely, and if pev<1 if they make each other less likely. We can say things like probabilistic evidence from independent events can be multiplied.

Ah, in the multiplicativity/additivity results I should have added "conditionally independent on A". I just made the correction, many thanks.

The proof should be easy to see now that it's stated correctly: you write down the conclusion pev(A,BC)=pev(A,B)·pev(A,C) and observe that it is the quotient of the independence assumptions,

P(BC|A) = P(B|A)·P(C|A) and

P(BC) = P(B)·P(C)

I've also added mention of the independence conditions under which likelihood ratios do multiply; I think the comparison is better this way. Cheers!

X represents a state of belief .... The problem is that, since B and C both depend on A, the condition (1) [P(BC|X) = P(B|X)·P(C|X)] is unstable with respect to P(A|X)

Isn't this equally a problem for likelihoods? If I understand you, the "independent tests" assumptions for likelihood-ratio multiplying are also "unstable with respect to P(A|X)". Explicitly, I mean

Even if P(A|X) stays constant, "changing X", i.e. making a new observation D, can change all of these assumptions, no?

If I've misunderstood, could you please explain in full detail

(0) What you mean by "unstable with respect to P(A|X)",

(1) How the "before and after" assumptions are "unstable", and

(2) How the "independent tests" assumptions are "stable"?

Even if this isn't it, I'm very curious to know if there are epistemological reasons to favor assumptions (1) over assumptions (2)...

Isn't this equally a problem for likelihoods? If I understand you, the "independent tests" assumptions for likelihood-ratio multiplying are also "unstable with respect to P(A|X)". Explicitly, I mean

(2) P(BC|XA) = P(B|XA)·P(C|XA) and P(BC|X~A) = P(B|X~[correction]A)·P(C|X~A).

Even if P(A|X) stays constant, "changing X", i.e. making a new observation D, can change all of these assumptions, no?

(0) What you mean by "unstable with respect to P(A|X)"

A condition is unstable with respect to R if it can be true for a certain value of R, but this does not imply that it will be true for other values of R.

(1) How the "before and after" assumptions are "unstable",

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

(2) How the "independent tests" assumptions are "stable"?

As I stated earlier, the condition A or the condition ~A screens off P(A|X).

I would have expected this to be clear if you had applied your questions to my example, and tried to construct a corresponding example in which Pev's can be multiplied.

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X)

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

but I can't be sure, and I have nothing precise to refer back to. (You don't have to oblige me, of course.)

It is not a problem for likelihood ratios because the inclusion of A or of ~A in the condition of each probability in (2) screens off P(A|X)

That your prose is not accompanied by any equations makes it very effortful to filter all its possible meanings to find the intelligent one you no doubt intend, and impractical to respond.

First, that is not a response to your quote from me. In that quote I was in fact pointing to elements of an equation.

I might disagree that

In order to simultaneously have B and C be independent given X and given XA, they need to have a complicated dependence given X~A, this dependence itself depending on P(A|X).

but I can't be sure, and I have nothing precise to refer back to. (You don't have to oblige me, of course.)

By "before" independence, we have

(1) P(B) = P(B|C).

Decompose B into BA or ~AB

(2) P(B|C) = P(AB|C) + P(~AB|C)
(3) P(B|C) = P(A|C)P(B|AC) + P(~A|C)P(B|~AC)

Apply "after" independence to (3)

(4) P(B|C) = P(A|C)P(B|A) + P(~A|C)P(B|~AC)

Combing (1) and (4)

(5) P(B) = P(A|C)P(B|A) + P(~A|C)P(B|~AC)

Solve for P(B|~AC)

(6) P(B|~AC) = (P(B) - P(A|C)P(B|A))/P(~A|C)

Use Bayes' Law substitute out conditional probabilities of A

(7) P(B|~AC) = (P(B) - P(A)(P(C|A)/P(C))P(B|A))/(P(~A)P(C|~A)/P(C))

I believe that (7) qualifies as a complicated dependence given ~A, which itself depends on P(A). This is just the result of my exploration of the issue, but you asked for it.

Also, I would like to reiterate:

I would have expected this to be clear if you had applied your questions to my example, and tried to construct a corresponding example in which Pev's can be multiplied.

Here's a simple example where pev(A,BC)=pev(A,B)pev(A,C):

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(BC)=1/16, P(ABC)=1/64.

Then just suppose there are some medical diagnostics with these probabilities, etc. But see below for a comparison of this with your coin scenario.

More generally, let a=P(A)=1/4, b=P(B), c=P(C), x=P(BC), y=P(AC), z=P(AB), t=P(ABC).

The "before and after independence" assumption [BAI] is that x=bc and at=yz

The "independent tests" assumption [IT] is that (x-t)=(b-z)(c-y) and at=yz.

(2) How the "independent tests" assumptions are "stable"?

As I stated earlier, the condition A or the condition ~A screens off P(A|X).

Neither [BAI] nor [IT] is stable with respect to the variable a=P(A), and I see no equation involving anything I'd call "screening off", though there might be one somewhere.

In any case, that does not interest me, because your equation (7) has my attention:

(7) P(B|~AC) = (P(B) - P(A)(P(C|A)/P(C))P(B|A))/(P(~A)P(C|~A)/P(C))

I bet your (7) gives a good description of this somehow, or something closely related... But I still have to formalize my thoughts about this.

Let me think about it awhile, and I'll post again if I understand or wonder something more precise.

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(BC)=1/16, P(ABC)=1/64.

I assume you meant:

[EG1] P(A)=1/4, P(B)=P(C)=1/2, P(BC)=1/4, P(AC)=P(AB)=1/16, P(ABC)=1/64.

Then just suppose there are some medical diagnostics with these probabilities, etc.

Yes... if a theory adds to the surprisal of an experimental result, then the experimental result adds precisely the same amount of the surprisal of the theory. That's interesting.

Typo: Next, let B be the event "X=??11", so P(B)=2 -2 ., and recall that A is the event "X=??11"., A should be X=111?

Interesting Log-Odds paper by Brian Lee and Jacob Sanders, November 2011.

log base 2 (a.k.a. bits)
10 * log base 10 (a.k.a. decibels)
log base 10

Roll on the log odds standardisation effort - by the sound of it!

Much like how inches and centimeters are off by a constant factor. Different log bases are analogous to different units.

Hi - I am just puzzled why inf(A|B) = inf(A) - iev(A, B).

Here is my derivation: inf(A|B) = - log(P(A|B)) = - log(P(A)*pev(A, B)) = -log(P(A)) - log(pev(A, B)) = inf(A) + iev(A, B).

Could you please tell me where is wrong? Thank you very much!

This looks a wonderful post. I think I am going to get a notebook and go through it all step-by-step. May have some questions.

Okay, so let's take an example. Suppose there's a disease with prevalence P(D) = 6%, and a test with true positive rate P(+|D) = 90% and false positive rate P(+|~D) = 10%.

We have seen a positive result on the test.

We take the information-theoretic version of Bayes theorem:

inf(D|+) = inf(D) - iev(D,+)

         = inf(D) - inf(D) - inf(+) + inf(D and +)

         = -inf(+) + inf(rate of TRUE positives)

         = log2[ P(+|D)P(D) + P(+|~D)P(~D) ] - log2[P(+|D)P(D)]

inf(D|+) = log2[0.148] - log2[0.9*0.06] ~= 1.45 bits (= 36%)

Now suppose the prevalence of the disease was 70%; then we find

inf(D|+) ~= 0.07 bits (= 95%)

Which makes sense, because the second test merely confirmed what was already likely; hence it is less informative (but even the first is not terribly, due to low prior).

Yeah, I can definitely see the appeal of this method. Great post, thanks!

An even more convenient example:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

inf(C) = -log2(0.01) = 6.64
iev(+,C) = inf(+) + inf(C) - inf(+ & C)
         = - log2(0.01*0.8+0.99*0.096) - log2(0.01) + log2(0.01*0.8)
         = 3.28 + 6.64 - 6.97 = 2.95
inf(C|+) = inf(C) - iev(+,C) = 6.64 - 2.95 = 3.69
P(C|+) = 2^(-3.69) = 7.7%

It works! That was fun.

Hmm. An old book that we should read to learn science...

65

Information theory and the symmetry of updating beliefs

65

The beautiful symmetry of Bayesian updating

Odds and log odds: a short comparison.

Further discussion of information

65

65