Related to: Occam's Razor
If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, "As the complexity of your hypotheses goes to infinity, their probability goes to zero," then it can be proven from a few assumptions.
1) The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.
2) A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.
3) The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.
The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.
From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.
N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn't remove anything because it would make the comments difficult to understand for later readers.
I'm completely flummoxed by the level of discussion here in the comments to Unknowns's post. When I wrote a post on logic and most commenters confused truth and provability, that was understandable because not everyone can be expected to know mathematical logic. But here we see people who don't know how to sum or reorder infinite series, don't know what a uniform distribution is, and talk about "the 1/infinity kind of zero". This is a rude wakeup call. If we want to discuss issues like Bayesianism, quantum mechanics or decision theory, we need to take every chance to fill the gaps in our understanding of math.
To answer your question: [0,1] does have the same cardinality as all reals, so in the set-theoretic sense they're equivalent. But they are more than just sets: they come equipped with an additional structure, "measure"). A probability distribution can only be defined as uniform with regard to some measure. The canonical measure of the whole [0,1] is 1, so you can set up a uniform distribution that says the probability of each measurable subset is the measure of that subset (alternatively, the integral of the constant function f(x)=1 over that subset). But the canonical measure of the whole real line is infinite, so you cannot build a uniform distribution with respect to that.
After you're comfortable with the above idea, we can add another wrinkle: even though we cannot set up a uniform distribution over all reals, we can in some situations use a uniform prior over all reals. Such things are called improper priors and rely on the lucky fact that the arithmetic of Bayesian updating doesn't require the integral of the prior to be 1, so in well-behaved situations even "improper" prior distributions can always give rise to "proper" posterior distributions that integrate to 1.
Gosh, now I don't know whether to feel bad or not for asking that question.
But I guess 'no, it's not just cardinality that matters but measure' is a good answer. Is there any quick easy explanation of measure and its use in probability?
(I have yet to learn anything from Wikipedia on advanced math topics. You don't hear bad things about Wikipedia's math topics, but as Bjarne said of C++, complaints mean there are users.)