Open Thread: September 2009

AllanCrossman

Open Thread: September 2009 — LessWrong

Comment Permalink

No, let me try nailing this jelly to the wall once again. The definition-only-up-to-a-constant is a weakness of MDL, but this weakness isn't relevant to my question at all! Even if we had some globally unique variant of MDL derived from some nice mathematical idea, learning theory still doesn't use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules. To my mind this casts doubt on the importance of MDL.

DanB17y00

learning theory still doesn't use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules

Any delineation of a small set of rules leads immediately to a short description length for the rules. You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.

Note that MDL is not the same as algorithmic information theory (definition-up-to-a-constant comes up in AIT, not MDL), though they're of course related.

0Richard_Kennaway17y

I think its uncomputability already does that. When you make a computable version by limiting attention to some framework of descriptive capabilities smaller than universal computation, different choices of that framework will give you different measures of simplicity. What is simple in one framework may seem elaborate and baroque in another. Or as some military strategist once put it: "To the foot-soldier, the strategy of a general may seem obscure, shrouded in shadows and fog, but to the general himself, his way is as plain as if he were marching his army down a broad, straight highway."

4Johnicholas17y

Consider this alternative characterization. Someone wants to fit a polynomial to some data. They pre-selected a sparse set of polynomials, which are in general ridiculously complex. Against all odds, they get a good fit to the training data. This theorem says that, because they haven't examined lots and lots of polynomials, they definitely haven't fallen into the trap of overfitting. Therefore, the good fit to the training data can be expected to generalize to the real data. Shalizi is saying that this story is fine as far as it goes - it's just not Occam's Razor.

See in context