Oscar_Cunningham comments on Putting in the Numbers - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (32)
I've never been able to understand this.
Surely the correct course of action in this situation is to have a prior for the possible biases of the die, say the uniform prior on {x in R^4 : x1+x2+x3+x4=1, xi>=0 for all i}, and then update Bayesianly by restricting to the subset where the average is 3. Then to find the distribution for the outcomes of the die we integrate over this.
I'm pretty sure this doesn't give the same distribution as maxent, and I can't think of a prior that would. (I think my suggested prior gives the "straight lines" distribution that you wanted!)
So when are each of these procedures appropriate? I agree that maxent is a good way to assign priors, but I think that when you have data you should use it by updating, rather than by remaking you prior.
I don't think there's anything that says a maximum entropy prior is what you get if you construct a maximum entropy prior for a weaker subset of assumptions, and then update based on the complement.
EDIT: Jaynes elaborates on the relationship between Bayes and maximum entropy priors here (warning, pdf).
Okay, I have an answer for you.
In doing the Bayesian updating method, you assumed that the die has some weights, and that the die having different weights are events in event-space. This assumption is a very good one for a physical die, and the nature of the assumption is most obvious from the Kolmogorov and Savage perspectives.
Then, when translating the information that the expected roll was 5/2, you translated it as "the sum of weight 1 + 2 * weight 2 + 3 * weight 3 is equal to 5/2." (Note that this is not necessary! If you're symmetrically uncertain about the weights, the expected roll can still be 5/2. Frequentist intuitions are so sneaky :P )
What does the maximum entropy principle say if we give it that same information? The exact same answer you got! It maximizes entropy over those different possibilities in event-space, and the constraint that the weighted sum of the weights is 5/2 is interpreted in just the way you'd expect, leaving a straight line of possibilities in event-space with equal weights. Thus, maxent gives the same answer as Bayes' theorem for this question, and it certainly seems like it did so given the same information you used for Bayes' theorem.
Since it didn't give the same answer before, this means we're solving a different set of equations. Different equations means different information.
The state of information that I use in the post is different because we have no knowledge that the probabilities comes from some physical process with different weights. No physical events at all are entangled with the probabilities. It's obvious why this is unintuitive - any die has some physical weights underlying it. So calling our unknown number "the roll of a die" is actually highly misleading. My bad on that one - it looks like christopherj's concerns about the example being unrealistic were totally legit.
However, that doesn't mean that we'll never see our maximum entropy result in the physical world. Suppose that I started not knowing that the expected roll of the die was 5/2. And then someone offered to repeat not just "rolling the die," but to repeat experiments with equivalent states of knowledge many times. And then what they'll do is after 1000 repeats of experiments with the same state of knowledge, is if the average roll was really close to 5/2, they'll stop, but if the average roll wasn't 5/2 they'll try again until it is.
Since the probability given my state of knowledge is 1/3, I expect a repeat of many experiments with the same state of knowledge to be like a rolling a fair die many times, then only keeping ensembles with average 5/2. Then, if I look at this ensemble that represents your state of knowledge except for happening to have average roll 5/2, I will see a maximum entropy distribution of rolls. (proof left as an exercise :P ) This physical process encapsulates the information stated in the post, in a way that rolling a die whose weights are different physical events does not.
Wanna check? :)
I'll work in the easier case 1 dimension down. Say we have a die which rolls a 1, 2 or a 3, and we know it averages to 5/2.
Then {x in R^3 : x1+x2+x3=1, xi>=0 for all i} is an equilateral triangle, which we put an uniform distribution on. Then the points where the mean roll is 5/2 lie on a straight line from (1/4,0,3/4) to (0,1/2,1/2). By some kind of linearity argument the averages over this line (with the uniform weighting from our uniform prior) are just the average of (1/4,0,3/4) and (0,1/2,1/2). This gives (1/8,2/8,5/8).
On the other hand we know that maxent gives a geometric sequence. But (1/8,2/8,5/8) isn't geometric.
This may help. Abstract:
Thanks, that's interesting. But if we know that the expected roll is 2, then that must lie somewhere on the straight line between (1/2,0,1/2) and (0,1,0). This doesn't mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
I'll think about this some more - Cyan's link also goes into the problem a bit.
I know a handful of people who have built / are building PhDs on dealing with scoring approximation rules based on how they handle distributions that are sampled from all possible distributions that satisfy some characteristics. The impression I get is that the rabbit hole is pretty deep; I haven't read through it all but here are some places to start: Montiel: [1] [2], Hammond: [1] [2].
It seems to me that P1=(1/4,1/2,1/4) is "more robust" than P2=(1/3,1/3,1/3) in some way- suppose you remove y from x1 and add it proportionally to x2 and x3. The result would be closer to 2 for P1 than for P2, especially if y is a fraction of x1 rather than a flat amount.
But it also seems to me like having a uniform distribution across all possible distributions is kind of silly. Do I really think that (1/10,4/5,1/10) is just as likely as (1/3,1/3,1/3)? I suspect it's possible to have a prior which results in maxent posteriors, but it might be the case that the prior depends on what sort of update you provide (i.e. it only works when you know the variance, and not when you know the mean) and it might not exist for some updates.
Well, "a uniform distribution across possible distributions" is kinda nonsense. There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the "distribution across possible distributions" is just a delta function there.
Any non-delta "distribution over distributions" is laden with some model of what's going on in the die, and is a distribution over parts of that model. Maybe there's some subtle effect of singling out the complete, uniform model rather than integrating over some ensemble.
Whoa, you think the only correct interpretation of "there's a die that returns 1, 2, or 3" is to be absolutely certain that it's fair? Or what do you think a delta function in the distribution space means?
(This will have effects, and they will not be subtle.)
One of the classic examples of this is three interpretations of "randomly select a point from a circle." You could do this by selecting a angle for a radius uniformly, then selecting a point on that radius uniformly along its length. Or you could do those two steps, and then select a point along the associated chord uniformly at random. Or you could select x and y uniformly at random in a square bounding the circle, and reject any point outside the circle. Only the last one will make all areas in the circle equally likely- the first method will make areas near the center more likely and the second method will make areas near the edge more likely (if I remember correctly).
But I think that it generally is possible to reach consensus on what criterion you want (such as "pick a method such that any area of equal size has equal probability of containing the point you select.") and then it's obvious what sort of method you want to use. (There's a non-rejection sampling way to get the equal area method for the circle, by the way.) And so you probably need to be clever about how you parameterize your distributions, and what priors you put on those parameters, and eventually you do have hyperparameters that functionally have no uncertainty. (This is, for example, seeing a uniform as a beta(1/2,1/2), where you don't have a distribution on the 1/2s.) But I think this is a reasonable way to go about things.
In a separate comment, Kurros worries about cases with "no preferred parameterisation of the problem". I have the same worry as both of you, I think. I guess I'm less optimistic about the resolution. The parameterization seems like an empirical rabbit that Jaynes and other descendants of the Principle of Insufficient Reason are trying to pull out of an a priori hat. (See also Seidenfeld <pdf> section 3 on re-partitioning the sample space.)
I'd appreciate it if someone could assuage - or aggravate - this concern. Preferably without presuming quite as much probability and statistics knowledge as Seidenfeld does (that one went somewhat over my head, toward the end).
I haven't been able to follow this whole thread of conversation, but I think it's pretty clear you're talking about different things here.
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
Right; but putting a delta function there means you're infinitely certain that's what it is, because you give probability 0 to all other possibilities.
Knowing that the die is completely biased, but not which side it is biased towards, would be represented by three delta functions, at (1,0,0), (0,1,0), and (0,0,1), each with a coefficient of (1/3). This is very different from the uniform case and the delta at (1/3,1/3,1/3) case, as you can see by calculating the posterior distribution for observing that the die rolled a 1.
okay, and you were just trying to make sure that Manfred knows that all this probability-of-distributions speech you're speaking isn't, as he seems to think, about the degree-of-belief-in-my-current-state-of-ignorance distribution for the first roll. Gotcha.
Okay... but do we agree that the degree-of-belief distribution for the first roll is (1/3, 1/3, 1/3), whether it's a fair die or a completely biased in an unknown way die?
Because I'm pretty sure that's what Manfred's talking about when he says
and I think him going on to say
was a mistake, because you were talking about different things.
EDIT:
Ah. Yes. Okay. I am literally saying only things that you know, aren't I. My bad.
It's not about if the die is fair - my state of information is fair. Of that it is okay to be certain. Also, I think I figured it out - see my recent reply to Oscar's parent comment.