Another way of putting this is that all probability distributions over an infinite enumerated set are biased towards elements that occur earlier, regardless of the principle of enumeration.
Your proof can't be right because it doesn't use the concept of "complexity" in any non-trivial manner. If we replace "complexity" with "flubbity", so that there's only a finite number of hypotheses for any given "flubbity", your proof will still go through.
For some actual work on justifying Occam's Razor, see Kevin T. Kelly's Ockham Efficiency results. Kelly unpacks "complexity" in a nontrivial way that isn't just description length.
ETA: see this earlier discussion about learning and simplicity, it's very relevant to the topic of your interest.
This really doesn't seem to have formally proven anything, and so at least the title should not be "a proof of Occam's razor". And it doesn't seem like you've done anything more interesting or useful here than, say, Solomonoff induction, which can already be seen as a formalization of Occam's razor.
Maybe if you made your proof more rigorous and explained why you think the statement you're trying to prove is relevant to anyone (or should properly be called Occam's razor) then that would help.
To expand on cousin_it's point, this is way too weak because it allows for arbitrary encodings, like a complexity value of 1 for "The lady down the street is a witch; she did it."
And if you start with that complexity prior, it's difficult to see how evidence would rescue you from it.
I think the treatment of complexity is OK in this argument. However, I don't think it proves Occam's Razor. Other than possible exception within set theory, the complexity of an explanation of any phenomenon is always finite. (Counterexample if you disagree?) There is no finite integer which is 'almost' infinite.
So, given xn - the average probability among explanations of complexity n - it's quite possible for a (finite) positive number of integers m > n for which xm > xn.
It seems like what you're saying is: "If we have an infinite number of hypotheses, some of them must have infinetisimally small prior probability". That doesn't require a proof.
(I upvoted this article in spite of not learning much from it, since I love this kind of discussion. Why not read some papers about PAC learning, MDL, Algorithmic Info. Theory, or VC theory and compare/contrast for us the view of Occam implicit in each development?)
Clearly as complexity goes to infinity the probability goes to 0. (i.e. For any epsilon>0 there exists n such that all hypotheses with complexity>n have probability less than epsilon. (since otherwise there would be infinitely many hypotheses with probability >epsilon, taking 1/epsilon of them would produce a collection of hypotheses with probability>1))
So as the post says, "on average" Occam's razor must be true. (As must any other Razor using a different measure of complexity, or flubbity, or whatever. However, message length seems ...
The part about there being sets of infinitely many mutually exclusive hypotheses needs to be marked as an assumption, since I can construct languages where it does not hold. Also, you need to strengthen this assumption by saying that every hypothesis is part of at least one such set.
You also need to somehow get from "the probability of a statement decreases monotonically with its complexity" to "the probability of a statement decreases exponentially with its complexity".
Maybe this has already been considered but has anyone noted that even without introducing complexity at all, you can at least say that some programs are more likely simply because there is no uniform distribution on a countably infinite set, no matter which probability measure or which way you label the set there must be at most a finite number of programs with the same probability.
Hopefully there's someone still here after 10 years XD.
Oh look, if we definitely the complexity as "the date when hypothesis was published", then I can say that the prior probability that our earth stands on top of a whale, on top of a turtle on top of an elephant is the highest, because this hypothesis is the oldest. And the Occam's razor becomes "don't propose new hypotheses". Trinitrotrololol)
I find it funny, that it works even in continuous case: suppose that we have probability density defined in R^n (or any other set). Then whatever bijection F:R <--> R^n we apply, the integral of probability density on that path should converge, therefore p(F(x)) goes to zero faster than 1/x. :)
Also, look: suppose the "real" universe is a random point x from some infinite set X. Let's say we are considering finite set of hypotheses "H". Probability that random hypothesis h € H is closest to x is 1/|H|. So the larger H is, the less likely it is that a
...I don't think this proves anything at a practical level, given that it requires us to deal with hypotheses of arbitrarily high complexity. It breaks down when dealing with finite values.
Suppose there is a finite complexity value W, beyond which no more complex hypotheses can actually be imagined by any being in our universe. Physics implies that there is such a W. Nothing in your argument implies that z1<W, though.
Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
x1, x2,...
There does not appear to be any a priori ordering of probability by complexity value, and unless you assume this the conclusion is improperly drawn. If you assume it, the argument is circular.
If the issue I'm missing is that the conclusion should read:
hypotheses with a greater complexity value tend to have a lower prior probability
Then that would explain my confusion. It does make the conclusion pretty weak, as it's unclear if infinite possibilities even exist. If they don't, the proof collapses.
I'd written out more reasoning, but I think this works ...
Hmmm... I'd say that the problem is formulated in a way that allows for some maths, but in the process loses some of its essence.
For example, the 'proof' only talks about the length of the hypothesis, because that can be easily quantified. However, that does not rule out very short pseudo-explanations like 'god did it' or 'it's magic'. To rule out those, in practice one would need to introduce some non-mathematical language about what language is acceptable -- but that would make it very hard to reason about it mathematically.
As some commentators didn't like or couldn't understand the original phrasing of the theorem, I have added a phrase interpreting this. I didn't remove the phrase because it would make the previous comments difficult to understand.
Occams's Razor is based on empirical observations.
You can't prove it - indeed, you can easily imagine worlds where it is not true.
Define the "simplicity" of a hypothesis as 1/"complexity". If you can construct a probability distribution based on the "simplicity" of a hypothesis, then, by the same argument as above, you can prove that you should prefer the hypothesis with less "simplicity".
Related to: Occam's Razor
If the Razor is defined as, “On average, a simpler hypothesis should be assigned a higher prior probability than a more complex hypothesis,” or stated in another way, "As the complexity of your hypotheses goes to infinity, their probability goes to zero," then it can be proven from a few assumptions.
1) The hypotheses are described by a language that has a finite number of different words, and each hypothesis is expressed by a finite number of these words. That this allows for natural languages such as English, but also for computer programming languages and so on. The proof in this post is valid for all cases.
2) A complexity measure is assigned to hypotheses in such a way that there are or may be some hypotheses which are as simple as possible, and these are assigned the complexity measure of 1, while hypotheses considered to be more complex are assigned higher integer values such as 2, 3, 4, and so on. Note that apart from this, we can define the complexity measure in any way we like, for example as the number of words used by the hypothesis, or in another way, as the shortest program which can output the hypothesis in a given programming language (e.g. the language of the hypotheses might be English but their simplicity measured according to a programming language; Eliezer Yudkowsky follows this way in the linked article.) Many other definitions would be possible. The proof is valid for all definitions that follow the conditions laid out.
3) The complexity measure should also be defined in such a way that there are a finite number of hypotheses given the measure of 1, a finite number given the measure of 2, a finite number given the measure of 3, and so on. Note that this condition is not difficult to satisfy; it would be satisfied by either of the definitions mentioned in condition 2, and in fact by any reasonable definition of simplicity and complexity. The proof would not be valid without this condition precisely because if simplicity were understood in such a way as to allow for an infinite number of hypotheses with minimum simplicity, the Razor would not be valid for that understanding of simplicity.
The Razor follows of necessity from these three conditions. To explain any data, there will be in general infinitely many mutually exclusive hypotheses which could fit the data. Suppose we assign prior probabilities for all of these hypotheses. Given condition 3, it will be possible to find the average probability for hypotheses of complexity 1 (call it x1), the average probability for hypotheses of complexity 2 (call it x2), the average probability for hypotheses of complexity 3 (call it x3), and so on. Now consider the infinite sum “x1 + x2 + x3…” Since all of these values are positive (and non-zero, since zero is not a probability), either the sum converges to a positive value, or it diverges to positive infinity. In fact, it will converge to a value less than 1, since if we had multiplied each term of the series by the number of hypotheses with the corresponding complexity, it would have converged to exactly 1—because probability theory demands that the sum of all the probabilities of all our mutually exclusive hypotheses should be exactly 1.
Now, x1 is a finite real number. So in order for this series to converge, there must be only a finite number of later terms in the series equal to or greater than x1. There will therefore be some complexity value, y1, such that all hypotheses with a complexity value greater than y1 have an average probability of less than x1. Likewise for x2: there will be some complexity value y2 such that all hypotheses with a complexity value greater than y2 have an average probability of less than x2. Leaving the derivation for the reader, it would also follow that there is some complexity value z1 such that all hypotheses with a complexity value greater than z1 have a lower probability than any hypothesis with a complexity value of 1, some other complexity value z2 such that all hypotheses with a complexity value greater than z2 have a lower probability than any hypothesis of complexity value 2, and so on.
From this it is clear that on average, or as the complexity tends to infinity, hypotheses with a greater complexity value have a lower prior probability, which was our definition of the Razor.
N.B. I have edited the beginning and end of the post to clarify the meaning of the theorem, according to some of the comments. However, I didn't remove anything because it would make the comments difficult to understand for later readers.