I am also a big fan of MEP, my Phd thesis was on a related topic. But I don't believe that MEP actually gets around the Fundamental Difficulty of Bayesian statistics which is the subjectivity of the choice of the prior. It simply repackages the subjectivity in another form. To give a simple example, imagine you had a data set of 6-sided die outcomes:
4,1,4,4,3,2,5,6,4,2,3, . ...
Now one thing you might do is calculate the mean of this data set. Let's say the mean is 4.5, as in the example described in Section 2 of the paper. Then we apply the MEP and we get a distribution. You might say: great, we've done statistics with no subjectivity!
I say, not so fast. You actually did do something subjective: you decided that the mean was the key statistic that should be taken into account. But why? Let's say instead of the mean, we counted the number of outcomes where X <= 2, and we found that 40 / 100 outcomes satisfied this criterion. The use of this statistic would result in an entirely different probability distribution, specifically, one in which P(X==1)=P(X==2) = 20%, while P(X==3)=P(X==4)... = 15%. Alternatively, you could use BOTH these statistics, and get another distinct distribution. Indeed, this ability to combine statistical information from many distinct sources is in my view the strength of the method.
Anyway, that's where the subjectivity comes in: from the choice of which statistics to use.
Notice that in the connection to thermodynamics, the "correct" statistic - average energy - is given to us by an external physical theory, not MEP.
Another issue with MEP is that it does not contain any intrinsic method to prevent overfitting. If you measure thousands of statistics about a data set, then you will get a very complex distribution, but if the data set has only a few hundred samples, then you've just overfit it.
Very interesting. I agree that the MEP does not solve everything (though Solomonoff induction does).
The use of the mean is a premise. That is, assuming you know the mean, the Maximum Entropy distribution is the correct distribution. If you know some other measure, then you can find the ME distribution that has that measure. If you don't know anything about the distribution, then the Maximum Entropy principle still works by giving you the flat prior. If this is over all reals, it's the "improper" prior, but it's still the correct one.
...Another iss
After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.
I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:
Can the maximum entropy principle be explained as a consistency requirement?
The constraint rule of the maximum entropy principle
I was wondering what you think about the principle of maximum entropy and its justifications.