The Principle of Maximum Entropy

krey

After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.

I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:

Can the maximum entropy principle be explained as a consistency requirement?

The constraint rule of the maximum entropy principle

I was wondering what you think about the principle of maximum entropy and its justifications.

I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:

Can the maximum entropy principle be explained as a consistency requirement?

The constraint rule of the maximum entropy principle

I was wondering what you think about the principle of maximum entropy and its justifications.

I am also a big fan of MEP, my Phd thesis was on a related topic. But I don't believe that MEP actually gets around the Fundamental Difficulty of Bayesian statistics which is the subjectivity of the choice of the prior. It simply repackages the subjectivity in another form. To give a simple example, imagine you had a data set of 6-sided die outcomes:

4,1,4,4,3,2,5,6,4,2,3, . ...

Now one thing you might do is calculate the mean of this data set. Let's say the mean is 4.5, as in the example described in Section 2 of the paper. Then we apply the MEP and we get a distribution. You might say: great, we've done statistics with no subjectivity!

I say, not so fast. You actually did do something subjective: you decided that the mean was the key statistic that should be taken into account. But why? Let's say instead of the mean, we counted the number of outcomes where X <= 2, and we found that 40 / 100 outcomes satisfied this criterion. The use of this statistic would result in an entirely different probability distribution, specifically, one in which P(X==1)=P(X==2) = 20%, while P(X==3)=P(X==4)... = 15%. Alternatively, you could use BOTH these statistics, and get another distinct distribution. Indeed, this ability to combine statistical information from many distinct sources is in my view the strength of the method.

Anyway, that's where the subjectivity comes in: from the choice of which statistics to use.

Notice that in the connection to thermodynamics, the "correct" statistic - average energy - is given to us by an external physical theory, not MEP.

Another issue with MEP is that it does not contain any intrinsic method to prevent overfitting. If you measure thousands of statistics about a data set, then you will get a very complex distribution, but if the data set has only a few hundred samples, then you've just overfit it.

Very interesting. I agree that the MEP does not solve everything (though Solomonoff induction does).

The use of the mean is a premise. That is, assuming you know the mean, the Maximum Entropy distribution is the correct distribution. If you know some other measure, then you can find the ME distribution that has that measure. If you don't know anything about the distribution, then the Maximum Entropy principle still works by giving you the flat prior. If this is over all reals, it's the "improper" prior, but it's still the correct one.

Another iss

... (read more)

2Manfred14y

I don't recall Jaynes ever using MEP to put rules to data, though maybe I've just forgotten. Have you done stuff with the minimum message length prior over rules?

0buybuydandavis14y

I don't think your decision to calculate a mean has anything to do with Maximum Entropy methods. One way of calculating the maximum entropy distribution is by using whatever moments of your data you have available, such as a mean. Or maybe you have only saved moments of the data in your data collection efforts. You can calculate the maxent distribution with those moments as constraints on the resulting distribution. I didn't have call to use the method, but my recollection is that Jaynes described the method in general terms for whatever informational constraints you have. [...] Some people have talked about calculating Maxent distributions taking the uncertainty of the moments used into account, any uncertainty which would increase for the higher order moments. I'm not sure if that would prevent the kind of thing you're worried about. In the usual case, of calculation of the maxent dist from moments, I don't know that you get any different result than just using more moments as if they were accurate. Has anyone compared the two? On the other hand, I'm not sure what overfitting means when you are assigning a probability distribution to represent your state of knowledge. To what do you think you've overfit?

13

The Principle of Maximum Entropy

13

13

13

The Principle of Maximum Entropy

13

13