Followup To: Foundations of Probability
In the previous post, we reviewed reasons why having probabilities is a good idea. These foundations defined probabilities as numbers following certain rules, like the product rule and the rule that mutually exclusive probabilities sum to 1 at most. These probabilities have to hang together as a coherent whole. But just because probabilities hang together a certain way, doesn't actually tell us what numbers to assign.
I can say a coin flip has P(heads)=0.5, or I can say it has P(heads)=0.999; both are perfectly valid probabilities, as long as P(tails) is consistent. This post will be about how to actually get to the numbers.
If the probabilities aren't fully determined by our desiderata, what do we need to determine the probabilities? More desiderata!
Our final desideratum is motivated by the perspective that our probability is based on some state of information. This is acknowledged explicitly in Cox's scheme, but is also just a physical necessity for any robot we build. Thus we add our new desideratum: Assign probabilities that are consistent with the information you have, but don't make up any extra information. It turns out this is enough to let us put numbers to the probabilities.
In its simplest form, this desideratum is a symmetry principle. If you have the exact same information about two events, you should assign them the same probability - giving them different probabilities would be making up extra information. So if your background information is "Flip a coin, the mutually exclusive and exhaustive probabilities are heads and tails," there is a symmetry between the labels "heads" and "tails," which given our new desideratum lets us assign each P=0.5.
Sometimes, though, we need to pull out the information theory. Using the fact that it doesn't produce information to split the probabilities up differently, we can specify something called "information entropy" (For more thoroughness, see chapter 11 of Jaynes). The entropy of a probability distribution is a function that measures how uncertain you are. If I flip a coin and don't know about the outcome, I have one bit of entropy. If I flip two coins, I have two bits of entropy. In this way, the entropy is like the amount of information you're "missing" about the coin flips.
The mathematical expression for information entropy is that it's the sum of each probability multiplied by its log. Entropy = -Sum( P(x)·Log(P(x)) ), where the events x are mutually exclusive. Assigning probabilities is all about maximizing the entropy while obeying the constraints of our prior information.
Suppose we roll a 4-sided die. Our starting information consists of our knowledge that there are sides numbered 1 to 4 (events 1, 2, 3, and 4 are exhaustive), and the die will land on just one of these sides (they're mutually exclusive). This let's us write our information entropy as -P(1)·Log(P(1)) - P(2)·Log(P(2)) - P(3)·Log(P(3)) - P(4)·Log(P(4)).
Finding the probabilities is a maximization problem, subject to the constraints of our prior information. For the simple 4-sided die, our information just says that the probabilities have to add to 1. Simply knowing the fact that the entropy is concave down tells us that to maximize entropy we should split it up as evenly as possible - each side has a 1/4 chance of showing.
That was pretty commonsensical. To showcase the power of maximizing information entropy, we can add an extra constraint.
If we have additional knowledge that the average roll of our die is 3, then we want to maximize -P(1)·Log(P(1)) - P(2)·Log(P(2)) - P(3)·Log(P(3)) - P(4)·Log(P(4)), given that the sum is 1 and the average is 3. We can either plug in the constraints and set partial derivatives to zero, or we can use a maximization technique like Lagrange multipliers.
When we do this (again, more details in Jaynes ch. 11), it turns out the the probability distribution is shaped like an exponential curve. Which was unintuitive to me - my intuition likes straight lines. But it makes sense if you think about the partial derivative of the information entropy: 1+Log(P) = [some Lagrange multiplier constraints]. The steepness of the exponential controls how shifted the average roll is.
The need for this extra desideratum has not always been obvious. People are able to intuitively figure out that a fair coin lands heads with probability 0.5. Seeing that their intuition is so useful, some people include that intuition as a fundamental part of their method of probability. The counter to this is to focus on constructing a robot, which only has those intuitions we can specify unambiguously.
Another alternative to assigning probabilities based on maximum entropy is to pick a standard prior and use that. Sometimes this works wonderfully - it would be silly to rederive the binomial distribution every time you run into a coin-flipping problem. But sometimes people will use a well-known prior even if it doesn't match the information they have, just because their procedure is "use a well-known prior." The only way to be safe from that mistake and from interminable disputes over "which prior is right" is to remember that a prior is only correct insofar as it captures some state of information.
Next post, we will finally get to the problem of logical uncertainty, which will shake our foundations a bit. But I really like the principle of not making up information - even a robot that can't do hard math problems can aspire to not make up information.
Part of the sequence Logical Uncertainty
Previous Post: Foundations of Probability
Next post: Logic as Probability
Okay, I have an answer for you.
In doing the Bayesian updating method, you assumed that the die has some weights, and that the die having different weights are events in event-space. This assumption is a very good one for a physical die, and the nature of the assumption is most obvious from the Kolmogorov and Savage perspectives.
Then, when translating the information that the expected roll was 5/2, you translated it as "the sum of weight 1 + 2 * weight 2 + 3 * weight 3 is equal to 5/2." (Note that this is not necessary! If you're symmetrically uncertain about the weights, the expected roll can still be 5/2. Frequentist intuitions are so sneaky :P )
What does the maximum entropy principle say if we give it that same information? The exact same answer you got! It maximizes entropy over those different possibilities in event-space, and the constraint that the weighted sum of the weights is 5/2 is interpreted in just the way you'd expect, leaving a straight line of possibilities in event-space with equal weights. Thus, maxent gives the same answer as Bayes' theorem for this question, and it certainly seems like it did so given the same information you used for Bayes' theorem.
Since it didn't give the same answer before, this means we're solving a different set of equations. Different equations means different information.
The state of information that I use in the post is different because we have no knowledge that the probabilities comes from some physical process with different weights. No physical events at all are entangled with the probabilities. It's obvious why this is unintuitive - any die has some physical weights underlying it. So calling our unknown number "the roll of a die" is actually highly misleading. My bad on that one - it looks like christopherj's concerns about the example being unrealistic were totally legit.
However, that doesn't mean that we'll never see our maximum entropy result in the physical world. Suppose that I started not knowing that the expected roll of the die was 5/2. And then someone offered to repeat not just "rolling the die," but to repeat experiments with equivalent states of knowledge many times. And then what they'll do is after 1000 repeats of experiments with the same state of knowledge, is if the average roll was really close to 5/2, they'll stop, but if the average roll wasn't 5/2 they'll try again until it is.
Since the probability given my state of knowledge is 1/3, I expect a repeat of many experiments with the same state of knowledge to be like a rolling a fair die many times, then only keeping ensembles with average 5/2. Then, if I look at this ensemble that represents your state of knowledge except for happening to have average roll 5/2, I will see a maximum entropy distribution of rolls. (proof left as an exercise :P ) This physical process encapsulates the information stated in the post, in a way that rolling a die whose weights are different physical events does not.