Daniel_Burfoot comments on The prior of a hypothesis does not depend on its complexity - Less Wrong

26 Post author: cousin_it 26 August 2010 01:20PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (59)

You are viewing a single comment's thread.

Comment author: Daniel_Burfoot 27 August 2010 03:08:00AM *  1 point [-]

The complexity of "A&B&C&D" is roughly equal to the complexity of "A||B||C||D", but we know for certain that the former hypothesis can never be more probable than the latter,

I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C? More careful definitions would lead to a more traditional conclusion.

For one way to formalize the A/B/C/D point, imagine you are trying to solve the UCI Census Income problem, where one attempts to predict whether a person's income is greater than 50k based on various other factors. Let A/B/C/D refer to binary predictors such as (is-married?), (is-religious?), (has-college-degree?), and (is-military?). The goal is to find a combination of these that predicts income well. Now it is of course true that (A||B||C||D) is more likely than (A&B&C&D) for a person. But there is no prior reason to believe that the OR-hypothesis has more predictive power than the AND-hypothesis, just as a KC-style analysis would indicate.

Comment author: cousin_it 27 August 2010 08:53:26AM *  1 point [-]

I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C?

Whaa? I found that paragraph more confused than anything I wrote :-) Kolmogorov complexity is defined for arbitrary bit strings, which can be not programs but (say) statements about the world expressed in some formal language. Yeah, the definition of KC quantifies over programs that generate those bit strings; so what.

Cannot parse your second paragraph because I don't understand why "predictive power" should be related to prior probability of a hypothesis.

Comment author: Daniel_Burfoot 28 August 2010 04:00:16AM 1 point [-]

why "predictive power" should be related to prior probability of a hypothesis.

To solve the learning problem as described, you must ask: what is the prior probability that a given rule/hypothesis will correctly predict income? It is trivially true that the OR-hypothesis selects a larger number of people, but there is no reason to believe it will more accurately predict income.

Since you don't buy the KC idea, do you also refuse to accept the more general idea of capacity control/regularization/MDL as a (the) way to prevent overfitting and achieve generalization? In the standard setting of the learning problem, it seems inevitable that some method of penalizing complexity is necessary for generalization.

Comment author: cousin_it 30 August 2010 11:46:00AM *  0 points [-]

I thought about it some more and it seems you're right. In learning problems we need some weighting of hypotheses to prevent overfitting, description length has no obvious downsides, so we can just use it and be happy. Now I just need to shoehorn Islam into the "learning problem" framework, to understand why our prior for it should be low...

Comment author: Vladimir_Nesov 30 August 2010 02:44:48PM 0 points [-]

This isn't about prior though.