PhilGoetz comments on Open thread, Mar. 9 - Mar. 15, 2015 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (109)
Basic question about bits of evidence vs. bits of information:
I want to know the value of a random bit. I'm collecting evidence about the value of this bit.
First off, it seems weird to say "I have 33 bits of evidence that this bit is a 1." What is a bit of evidence, if it takes an infinite number of bits of evidence to get 1 bit of information?
Second, each bit of evidence gives you a likelihood multiplier of 2. E.g., a piece of evidence that says the likelihood is 4:1 that the bit is a 1 gives you 2 bits of evidence about the value of that bit. Independent evidence that says the likelihood is 2:1 gives you 1 bit of evidence.
But that means a one-bit evidence-giver is someone who is right 2/3 of the time. Why 2/3?
Finally, if you knew nothing about the bit, and had the probability distribution Q = (P(1)=.5, P(0)=.5), and a one-bit evidence giver gave you 1 bit saying it was a 1, you now have the distribution P = (2/3, 1/3). The KL divergence of Q from P (log base 2) is only 0.0817, so it looks like you've gained .08 bits of information from your 1 bit of evidence. ???
I think I was wrong to say that 1 bit evidence = likelihood multiplier of 2.
IF you have a signal S, and P(x|S) = 1 while P(x|~S) = .5, then the likelihood multiplier is 2 and you get 1 bit of information, as computed by KL-divergence. That signal did in fact require an infinite amount of evidence to make P(x|S) = 1, I think, so it's a theoretical signal found only in math problems, like a frictionless surface in physics.
If you have a signal S, and P(x|S) = .5 while P(x|~S) = .25, then the likelihood multiplier is 2, but you get only .2075 bits of information.
There's a discussion of a similar question on stats.stackexchange.com . It appears that the sum, over a series of observations x, of
log(likelihood ratio = P(x | model 2) / P(x | model 1))
approximates the information gain from changing from model 1 to model 2, but not on a term-by-term basis. The approximation relies on the frequency of the observations in the entire observation series being drawn from a distribution close to model 2.