Wei_Dai comments on Open Thread: September 2009 - Less Wrong

2 Post author: AllanCrossman 01 September 2009 10:54AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (179)

You are viewing a single comment's thread. Show more comments above.

Comment author: cousin_it 07 September 2009 12:27:11PM *  2 points [-]

Just found this note in Shalizi's notebooks which casts an interesting shadow on the Solomonoff prior:

The technical results say that a classification rule is simple if it has a short description, measured in bits. (That is, we are in minimum description length land, or very close to it.) The shorter the description, the tighter the bound on the generalization error. I am happy to agree that this is a reasonable (if language-dependent) way of defining "simplicity" for classifier rules. However, so far as I can tell, this really isn't what makes the proofs work.

What actually makes the proofs go is the fact that there are not very many binary strings of length m for small m. Finding a classifier with a short description means that you have searched over a small space. It's the restriction of the search space which really does all the work. It's not the simplicity of the rules which matters, but the fact that simple rules are scarce in the space of all possible rules. If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.

For context, see the Wikipedia page on PAC learning or this lecture by Scott Aaronson.

Comment author: Wei_Dai 11 September 2009 06:57:41PM *  2 points [-]

Looks like I almost missed a very interesting discussion. Hope I'm not too late in joining it.

As far as I can tell, you still need the Solomonoff prior for decision making. Kevin T. Kelly's Ockham Efficiency Theorem says that by using Ockham's Razor you minimize reversals of opinion prior to finding the true theory, but that seems irrelevant when you have to bet on something, especially since even after you've found the truth using Ockham's Razor, you don't know that you've found it.

Also, I think there's a flaw in Shalizi's argument:

Suppose one of these rules correctly classifies all the data.

But if you're working with a sparse random subset of the rules, why would any of them correctly classify all the data? In algorithmic information theory, the set of rules is universal so one of them is guaranteed to fit the data (assuming the input is computable, which may not be a good assumption but that's a separate issue).

Comment author: cousin_it 15 September 2009 10:55:14AM *  0 points [-]

All good points if the universe has important aspects that are computable, which seems uncontroversial to me. Thanks for the link, I'd lost it sometime ago, that thread is pretty epic.