After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.

I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:

Can the maximum entropy principle be explained as a consistency requirement?

The constraint rule of the maximum entropy principle

I was wondering what you think about the principle of maximum entropy and its justifications.

New Comment
7 comments, sorted by Click to highlight new comments since:

I am also a big fan of MEP, my Phd thesis was on a related topic. But I don't believe that MEP actually gets around the Fundamental Difficulty of Bayesian statistics which is the subjectivity of the choice of the prior. It simply repackages the subjectivity in another form. To give a simple example, imagine you had a data set of 6-sided die outcomes:

4,1,4,4,3,2,5,6,4,2,3, . ...

Now one thing you might do is calculate the mean of this data set. Let's say the mean is 4.5, as in the example described in Section 2 of the paper. Then we apply the MEP and we get a distribution. You might say: great, we've done statistics with no subjectivity!

I say, not so fast. You actually did do something subjective: you decided that the mean was the key statistic that should be taken into account. But why? Let's say instead of the mean, we counted the number of outcomes where X <= 2, and we found that 40 / 100 outcomes satisfied this criterion. The use of this statistic would result in an entirely different probability distribution, specifically, one in which P(X==1)=P(X==2) = 20%, while P(X==3)=P(X==4)... = 15%. Alternatively, you could use BOTH these statistics, and get another distinct distribution. Indeed, this ability to combine statistical information from many distinct sources is in my view the strength of the method.

Anyway, that's where the subjectivity comes in: from the choice of which statistics to use.

Notice that in the connection to thermodynamics, the "correct" statistic - average energy - is given to us by an external physical theory, not MEP.

Another issue with MEP is that it does not contain any intrinsic method to prevent overfitting. If you measure thousands of statistics about a data set, then you will get a very complex distribution, but if the data set has only a few hundred samples, then you've just overfit it.

I don't recall Jaynes ever using MEP to put rules to data, though maybe I've just forgotten.

Have you done stuff with the minimum message length prior over rules?

Very interesting. I agree that the MEP does not solve everything (though Solomonoff induction does).

The use of the mean is a premise. That is, assuming you know the mean, the Maximum Entropy distribution is the correct distribution. If you know some other measure, then you can find the ME distribution that has that measure. If you don't know anything about the distribution, then the Maximum Entropy principle still works by giving you the flat prior. If this is over all reals, it's the "improper" prior, but it's still the correct one.

Another issue with MEP is that it does not contain any intrinsic method to prevent overfitting.

The MEP doesn't work if you assume you know statistics that you don't. Using a thousand statistics from a data sample should not be done because what you measure from the data sample aren't exactly the statistics from the true distribution. If you use the statistics that you do know, then the MEP is actually the exactly non-overfitting principle--it has exactly the information that you gave it.

The difficulty is in actually knowing any given statistic. Assuming you know one for the sake of actually getting anything done is where subjectivity comes in.

The MEP doesn't work if you assume you know statistics that you don't. Using a thousand statistics from a data sample should not be done because what you measure from the data sample aren't exactly the statistics from the true distribution.

Right, but what people use the MEP for in practice is to do statistical modeling: one has a data set of outcomes and attempts to build a statistical model of it. So you never know any statistic - even the mean - with absolute confidence.

In the phrase "the correct one", I have a problem with the word "the". See the discussion of the Bertrand paradox in krey's links.

For a specific example: I want to set a prior (an improper prior is okay!) for a constant in an Arrhenius equation for a chemical reaction. Oversimplified the equation looks like "r = A * exp(T/T0)". Oversimplify more, and pretend that T0 is known but I know nothing about A. Do I set a flat prior on A? But what if I instead chose to write the equation as "r = exp(T/T0 + a)". It's the same equation; A = exp(a). But the flat prior on a is not equivalent to the flat prior on A. Which do I choose?

You actually did do something subjective: you decided that the mean was the key statistic that should be taken into account.

I don't think your decision to calculate a mean has anything to do with Maximum Entropy methods. One way of calculating the maximum entropy distribution is by using whatever moments of your data you have available, such as a mean. Or maybe you have only saved moments of the data in your data collection efforts. You can calculate the maxent distribution with those moments as constraints on the resulting distribution. I didn't have call to use the method, but my recollection is that Jaynes described the method in general terms for whatever informational constraints you have.

Another issue with MEP is that it does not contain any intrinsic method to prevent overfitting.

Some people have talked about calculating Maxent distributions taking the uncertainty of the moments used into account, any uncertainty which would increase for the higher order moments. I'm not sure if that would prevent the kind of thing you're worried about. In the usual case, of calculation of the maxent dist from moments, I don't know that you get any different result than just using more moments as if they were accurate. Has anyone compared the two?

On the other hand, I'm not sure what overfitting means when you are assigning a probability distribution to represent your state of knowledge. To what do you think you've overfit?

Jaynes' book circulated for a while on the internet in draft form, and a number of chapters never made it to the final published version. One of those was a generalization of maximum entropy to dynamic systems, where he called the thing to be maximized guage, I believe. I think it was an exponential raised to a matrix, but it's pretty hazy for me at this point.

Did anything ever come of that?

I thought I saw a paper discussing it a while ago, published maybe around 2007.