Non-trivial probability distributions for priors and Occam's razor

JoshuaZ

Non-trivial probability distributions for priors and Occam's razor — LessWrong

5 Non-trivial probability distributions for priors and Occam's razor

11th Jan 2011

1 min read

5

Assume we have a countable set of hypotheses described in some formal way with some prior distribution such that 1) our prior for each hypothesis is non-zero 2) our formal description system has only a finite number of hypotheses of any fixed length. Then, I claim that that under just this set of weak constraints, our hypotheses are under a condition that informally acts a lot like Occam's razor. In particular, let h(n) be the the probability mass assigned to "a hypothesis with description at least exactly n is correct." (ETA: fixed from earlier statement) Then, as n goes to infinity, h(n) goes to zero. So, when one looks in the large-scale complicated hypotheses must have low probability. This suggests that one doesn't need any appeal to computability or anything similar to accept some form of Occam's razor. One only needs that one has a countable hypothesis space, no hypothesis has probability zero or one, and that has a non-stupid way of writing down hypotheses.

A few questions: 1) Am I correct in seeing this as Occam-like or is this just an indication that I'm using too weak a notion of Occam's razor?

2) Is this point novel? I'm not as familiar with the Bayesian literature as other people here so I'm hoping that someone can point out if this point has been made before.

ETA: This was apparently a point made by Unknowns in an earlier thread which I totally forgot but probably read at the time. Thanks also for the other pointers.

Personal Blog

5

New Comment

22 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:31 AM

[-]Zack_M_Davis15y60

Is this point novel?

Compare Rob Zahra's April 2009 comment.

[-]cousin_it15y20

Thanks, that's a very good overview that I haven't seen before.

[-]Unnamed15y40

A Proof of Occam's Razor

[-]JoshuaZ15y00

Oh doh! I even posted in that thread. I have zero recall of reading that but I see comments made by me in the thread so I must have read it...gah. Now I feel silly.

[-]Manfred15y30

There is a problem - length isn't the only way to order hypotheses. If there was some other method of ordering that fulfilled the right conditions (even just some obvious function of length, e.g. length char value of the first letter), it could be used instead to order hypotheses, which would then be decreasing in probability on that* parameter instead. You could give bonus points for mentioning the Flying Spaghetti Monster if you wanted, and it would be a perfectly fine ordering.

[-]Sniffnoy15y20

That doesn't in any way contradict this, that just demonstrates the fact that limits of sequences are invariant under a permutation of that sequence.

[-]Manfred15y00

It quite thoroughly contradicts this, since it means that this isn't a proof of Occam's razor, merely a proof that there is some sequence.

[-]Sniffnoy15y20

Yes, it demonstrates that this statement is too weak to be really called "Occam's razor". But it doesn't contradict the statement. Is this standard, to use "contradict" to mean "contradict the intended connotation of"? That seems confusing.

[-]jsteinhardt15y20

I feel like at the very least, a demonstration that the argument, while technically correct, doesn't achieve the real-world implications it intended to should be taken as an invalidation of the point. I find that it is far too often that someone clings to the fact that their argument is technically correct even though it is irrelevant practically.

Not that anyone is doing that in this thread, as far as I can tell. But it's something to watch out for.

[-]jimrandomh15y20

This is certainly an interesting line of investigation, and as far as I know it's still unsolved. There's still some things that need to be proven to get to an Occam-like distribution, which I mentioned the last time this came up.

Specifically, I haven't seen any good justification for the assumption about there being a finite number of correct hypotheses (or any variant on this assumption; there's lots to choose from), since there could be hypothesis-schemas that generate infinite numbers of true hypotheses. A full analysis of this question would probably have to account for those, in a way that assigns all the hypotheses from a particular schema complexity of the schema. I also haven't seen anyone argue from "prior probabilities decrease monotonically" to "prior probabilities decrease exponentially"; it seems natural since the size of the space of hypotheses of length n increases exponentially with n, but I don't know how to prove that they don't instead decrease faster or slower than exponentially, or what other assumptions may be necessary.

[-]jsteinhardt15y20

You don't need finite, only countable. Assuming that our conception of the universe is approximately correct, we are only capable of generating a countable set of hypotheses.

[-]jimrandomh15y00

Huh? There are only countably many statements in any string-based language, so this includes decidedly non-Occamian scenarios like every statement being true, or every statement being true with the same probability.

[-]jsteinhardt15y00

I'm confused. What is the countable set of hypotheses you are considering? My claim is merely that if you have hypotheses H1, H2, ..., then p(Hi) > 1/n for at most n-1 values of i. This can be thought of as a weak form of Occam's razor.

In what sense is "every statement being true" a choice of a countable set of hypotheses?

I think maybe the issue is that we are using hypothesis in a different sense. In my case a hypothesis is a complete model of the world, so it is not possible for multiple hypotheses to be true. You can marginalize out / observe a bunch of variables to talk about a subset of the world, but your hypotheses should still be mutually exclusive.

[-]benelliott15y20

I think the hypotheses are assumed to be mutually exclusive. For example you could have a long list of possible sets of laws of physics, at most one is true of this universe.

[-]jimrandomh15y20

Right, that's another way of stating the same assumption. But we usually apply Occam's razor to statements in languages that admit sets of non-mutually-exclusive hypotheses of infinite size. So you'd need to somehow collapse or deduplicate those in a way that makes them finite.

[-]benelliott15y20

I find I mostly apply Occam's razor to mutually exclusive hypotheses, e.g. explanation A of phenomenon X is better than explanation B because it is simpler.

[-]jsteinhardt15y20

People are definitely aware of this in the literature, but still, congratulations on re-deriving it (and consider reading more of the literature if this stuff interests you). Another version: at most n hypothesis can have probability greater than 1/n. This doesn't even reference description length.

[-]benelliott15y40

Actually, at most (n-1) hypotheses can have probability greater than 1/n.

[-]Matt_Simpson15y00

I vaguely remember a similar point being made by Jaynes in PT:TLoS

[-]Perplexed15y00

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

[-]JoshuaZ15y00

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Hypotheses in this sense are exclusive. If you prefer, consider hypotheses to be descriptors of all possible data one will get from some infinite string of bits.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

Sorry, yes, h(n) should be for statements with length at least n not at most n.

[-]Manfred15y00

Insert "mutually exclusive" where appropriate :D

Yes, the argument is confused, but I think that's only the writing, not the idea. I think this may not be as general as it could be, though - it would be nice if Occam's razor applied for other conditions.

Oh, wait - it doesn't quite work. I should probably write my own reply for that bit.

Moderation Log