Johnicholas comments on Open Thread: September 2009 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (179)
No, let me try nailing this jelly to the wall once again. The definition-only-up-to-a-constant is a weakness of MDL, but this weakness isn't relevant to my question at all! Even if we had some globally unique variant of MDL derived from some nice mathematical idea, learning theory still doesn't use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules. To my mind this casts doubt on the importance of MDL.
Consider this alternative characterization. Someone wants to fit a polynomial to some data. They pre-selected a sparse set of polynomials, which are in general ridiculously complex. Against all odds, they get a good fit to the training data. This theorem says that, because they haven't examined lots and lots of polynomials, they definitely haven't fallen into the trap of overfitting. Therefore, the good fit to the training data can be expected to generalize to the real data.
Shalizi is saying that this story is fine as far as it goes - it's just not Occam's Razor.
Good characterization. It's worth noting that learning theory never gives any kind of guarantee that you will actually find a function that provides a good fit to the training data, it just tells you that if you do, and the function comes from a low-complexity set, it will probably give good generalization.