Non-trivial probability distributions for priors and Occam's razor

2 Post author: JoshuaZ 11 January 2011 03:59AM

Assume we have a countable set of hypotheses described in some formal way with some prior distribution such that 1) our prior for each hypothesis is non-zero 2) our formal description system has only a finite number of hypotheses of any fixed length. Then, I claim that that under just this set of weak constraints, our hypotheses are under a condition that informally acts a lot like Occam's razor. In particular, let h(n) be the the probability mass assigned to "a hypothesis with description at least exactly n is correct." (ETA: fixed from earlier statement) Then, as n goes to infinity, h(n) goes to zero. So, when one looks in the large-scale complicated hypotheses must have low probability. This suggests that one doesn't need any appeal to computability or anything similar to accept some form of Occam's razor. One only needs that one has a countable hypothesis space, no hypothesis has probability zero or one, and that has a non-stupid way of writing down hypotheses.

A few questions: 1) Am I correct in seeing this as Occam-like or is this just an indication that I'm using too weak a notion of Occam's razor?   

2) Is this point novel? I'm not as familiar with the Bayesian literature as other people here so I'm hoping that someone can point out if this point has been made before.

ETA: This was apparently a point made by Unknowns in an earlier thread which I totally forgot but probably read at the time. Thanks also for the other pointers.

 

Comments (23)

Comment author: Zack_M_Davis 11 January 2011 04:49:36AM 3 points [-]

Is this point novel?

Compare Rob Zahra's April 2009 comment.

Comment author: cousin_it 11 January 2011 05:29:01AM *  1 point [-]

Thanks, that's a very good overview that I haven't seen before.

Comment author: Manfred 11 January 2011 05:21:38PM *  2 points [-]

There is a problem - length isn't the only way to order hypotheses. If there was some other method of ordering that fulfilled the right conditions (even just some obvious function of length, e.g. length * char value of the first letter), it could be used instead to order hypotheses, which would then be decreasing in probability on that parameter instead. You could give bonus points for mentioning the Flying Spaghetti Monster if you wanted, and it would be a perfectly fine ordering.

Comment author: Sniffnoy 11 January 2011 10:30:06PM 1 point [-]

That doesn't in any way contradict this, that just demonstrates the fact that limits of sequences are invariant under a permutation of that sequence.

Comment author: Manfred 12 January 2011 12:58:44AM *  0 points [-]

It quite thoroughly contradicts this, since it means that this isn't a proof of Occam's razor, merely a proof that there is some sequence.

Comment author: Sniffnoy 12 January 2011 01:41:18AM 1 point [-]

Yes, it demonstrates that this statement is too weak to be really called "Occam's razor". But it doesn't contradict the statement. Is this standard, to use "contradict" to mean "contradict the intended connotation of"? That seems confusing.

Comment author: jsteinhardt 13 January 2011 04:56:45AM 1 point [-]

I feel like at the very least, a demonstration that the argument, while technically correct, doesn't achieve the real-world implications it intended to should be taken as an invalidation of the point. I find that it is far too often that someone clings to the fact that their argument is technically correct even though it is irrelevant practically.

Not that anyone is doing that in this thread, as far as I can tell. But it's something to watch out for.

Comment author: Unnamed 11 January 2011 05:22:17AM 2 points [-]
Comment author: JoshuaZ 11 January 2011 05:26:34AM 0 points [-]

Oh doh! I even posted in that thread. I have zero recall of reading that but I see comments made by me in the thread so I must have read it...gah. Now I feel silly.

Comment author: jimrandomh 11 January 2011 02:12:28PM *  1 point [-]

This is certainly an interesting line of investigation, and as far as I know it's still unsolved. There's still some things that need to be proven to get to an Occam-like distribution, which I mentioned the last time this came up.

Specifically, I haven't seen any good justification for the assumption about there being a finite number of correct hypotheses (or any variant on this assumption; there's lots to choose from), since there could be hypothesis-schemas that generate infinite numbers of true hypotheses. A full analysis of this question would probably have to account for those, in a way that assigns all the hypotheses from a particular schema complexity of the schema. I also haven't seen anyone argue from "prior probabilities decrease monotonically" to "prior probabilities decrease exponentially"; it seems natural since the size of the space of hypotheses of length n increases exponentially with n, but I don't know how to prove that they don't instead decrease faster or slower than exponentially, or what other assumptions may be necessary.

Comment author: jsteinhardt 13 January 2011 04:58:27AM 1 point [-]

You don't need finite, only countable. Assuming that our conception of the universe is approximately correct, we are only capable of generating a countable set of hypotheses.

Comment author: jimrandomh 13 January 2011 12:42:20PM 0 points [-]

Huh? There are only countably many statements in any string-based language, so this includes decidedly non-Occamian scenarios like every statement being true, or every statement being true with the same probability.

Comment author: jsteinhardt 14 January 2011 06:34:26AM 0 points [-]

I'm confused. What is the countable set of hypotheses you are considering? My claim is merely that if you have hypotheses H1, H2, ..., then p(Hi) > 1/n for at most n-1 values of i. This can be thought of as a weak form of Occam's razor.

In what sense is "every statement being true" a choice of a countable set of hypotheses?

I think maybe the issue is that we are using hypothesis in a different sense. In my case a hypothesis is a complete model of the world, so it is not possible for multiple hypotheses to be true. You can marginalize out / observe a bunch of variables to talk about a subset of the world, but your hypotheses should still be mutually exclusive.

Comment author: benelliott 11 January 2011 06:22:55PM 1 point [-]

I think the hypotheses are assumed to be mutually exclusive. For example you could have a long list of possible sets of laws of physics, at most one is true of this universe.

Comment author: jimrandomh 11 January 2011 06:41:32PM 1 point [-]

Right, that's another way of stating the same assumption. But we usually apply Occam's razor to statements in languages that admit sets of non-mutually-exclusive hypotheses of infinite size. So you'd need to somehow collapse or deduplicate those in a way that makes them finite.

Comment author: benelliott 11 January 2011 07:00:21PM *  1 point [-]

I find I mostly apply Occam's razor to mutually exclusive hypotheses, e.g. explanation A of phenomenon X is better than explanation B because it is simpler.

Comment author: jsteinhardt 11 January 2011 05:00:31AM *  1 point [-]

People are definitely aware of this in the literature, but still, congratulations on re-deriving it (and consider reading more of the literature if this stuff interests you). Another version: at most n hypothesis can have probability greater than 1/n. This doesn't even reference description length.

Comment author: benelliott 11 January 2011 06:17:39PM 2 points [-]

Actually, at most (n-1) hypotheses can have probability greater than 1/n.

Comment author: Matt_Simpson 11 January 2011 11:38:30PM *  0 points [-]

I vaguely remember a similar point being made by Jaynes in PT:TLoS

Comment author: Perplexed 11 January 2011 04:25:01PM 0 points [-]

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

Comment author: JoshuaZ 11 January 2011 07:10:31PM *  0 points [-]

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Hypotheses in this sense are exclusive. If you prefer, consider hypotheses to be descriptors of all possible data one will get from some infinite string of bits.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

Sorry, yes, h(n) should be for statements with length at least n not at most n.

Comment author: Manfred 11 January 2011 05:14:53PM *  0 points [-]

Insert "mutually exclusive" where appropriate :D

Yes, the argument is confused, but I think that's only the writing, not the idea. I think this may not be as general as it could be, though - it would be nice if Occam's razor applied for other conditions.

Oh, wait - it doesn't quite work. I should probably write my own reply for that bit.