cousin_it comments on Open Thread: September 2009 - Less Wrong

2 Post author: AllanCrossman 01 September 2009 10:54AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (179)

You are viewing a single comment's thread. Show more comments above.

Comment author: cousin_it 07 September 2009 02:03:09PM *  0 points [-]

Not necessarily. (Or did I misread your comment?) The particular style can have an arbitrarily long/complex description, and learning will still work as long as the class of described rules is small enough. This observation seems to imply that algorithmic simplicity doesn't play the central role I'd imagined it to play. This is precisely the point where I'd like to hear LW's informed replies.

Comment author: RichardKennaway 07 September 2009 02:08:47PM 1 point [-]

Not necessarily. The particular style can have an arbitrarily long description, and learning will still work as long as the class of described rules is small.

Fixed overheads are ignored in the definition of Kolmogorov complexity.

A "particular style of baroque elaboration" is simply a program through whose eyes certain rules look short, which look elaborate and baroque through the eyes of another program.

Comment author: cousin_it 07 September 2009 02:16:32PM *  0 points [-]

The PAC-learning scenario doesn't have any parameter that goes to infinity, so I'm not sure why you dismiss "fixed" overheads :-)

Comment author: RichardKennaway 07 September 2009 03:00:29PM *  0 points [-]

The PAC-learning scenario doesn't have any parameter that goes to infinity, so I'm not sure why you dismiss "fixed" overheads :-)

Once you've chosen them, they're fixed, and don't run off to infinity.

This definition-only-up-to-a-constant is one of the weaknesses of minimum description length. (The other is its uncomputability. Shalizi somewhere else remarks that in discussions of algorithmic complexity, it is traditional to solemnly take out Kolmogorov complexity, exhibit its theoretical properties, remark on its uncomputability, and put it away again before turning to practical matters.)

This is precisely the point where I'd like to hear LW's informed replies.

FWIW, this is mine, informed or otherwise. Anyone else have light to shed?

Comment author: cousin_it 07 September 2009 03:22:58PM *  1 point [-]

No, let me try nailing this jelly to the wall once again. The definition-only-up-to-a-constant is a weakness of MDL, but this weakness isn't relevant to my question at all! Even if we had some globally unique variant of MDL derived from some nice mathematical idea, learning theory still doesn't use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules. To my mind this casts doubt on the importance of MDL.

Comment author: Johnicholas 07 September 2009 04:32:23PM *  3 points [-]

Consider this alternative characterization. Someone wants to fit a polynomial to some data. They pre-selected a sparse set of polynomials, which are in general ridiculously complex. Against all odds, they get a good fit to the training data. This theorem says that, because they haven't examined lots and lots of polynomials, they definitely haven't fallen into the trap of overfitting. Therefore, the good fit to the training data can be expected to generalize to the real data.

Shalizi is saying that this story is fine as far as it goes - it's just not Occam's Razor.

Comment author: Daniel_Burfoot 07 September 2009 06:09:13PM 1 point [-]

Good characterization. It's worth noting that learning theory never gives any kind of guarantee that you will actually find a function that provides a good fit to the training data, it just tells you that if you do, and the function comes from a low-complexity set, it will probably give good generalization.

Comment author: Daniel_Burfoot 07 September 2009 06:18:26PM 0 points [-]

learning theory still doesn't use description lengths, and would be perfectly happy with rules that have long descriptions as long as we delineate a small set of those rules

Any delineation of a small set of rules leads immediately to a short description length for the rules. You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.

Note that MDL is not the same as algorithmic information theory (definition-up-to-a-constant comes up in AIT, not MDL), though they're of course related.

Comment author: cousin_it 07 September 2009 06:54:08PM *  0 points [-]

You just need to encode the index of the rule in the set, costing log(N) bits for a set of size N.

I see everyone's assuming that some things, by their nature, always go to infinity (e.g. number of samples) while others stay constant (e.g. rule set). This is a nice convention, but not always realistic - and it certainly wasn't mentioned in the original formulation of the problem of learning, where everything is finite. If you really want things to grow, why don't you then allow the set itself to be specified by increasingly convoluted algorithms as N goes to infinity? Like, exponential in N? :-) Learning can still work - in theory, it'll work just as well as a simple rule set - but you'll have a hard time explaining that with MDL.

(If there's some theoretical justification why this kind of outrageous bloatware won't be as successful as simple algorithms, I'd really like to hear it...)

Comment author: Johnicholas 07 September 2009 07:14:14PM 1 point [-]

You might find what you're looking for in Kevin T. Kelly's work:

http://www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm

Dr. Shalizi mentioned it as an alternative formalization of Occam's Razor.

Comment author: cousin_it 07 September 2009 07:55:19PM *  0 points [-]

Johnicolas, thanks. This link and your other comments in those threads are very close to what I'm looking for, though this realization took me some time.

Comment author: RichardKennaway 07 September 2009 05:05:19PM *  0 points [-]

To my mind this casts doubt on the importance of MDL.

I think its uncomputability already does that. When you make a computable version by limiting attention to some framework of descriptive capabilities smaller than universal computation, different choices of that framework will give you different measures of simplicity. What is simple in one framework may seem elaborate and baroque in another. Or as some military strategist once put it:

"To the foot-soldier, the strategy of a general may seem obscure, shrouded in shadows and fog, but to the general himself, his way is as plain as if he were marching his army down a broad, straight highway."