Comment author:gwern
01 June 2012 03:28:25AM
4 points
[-]

Fortunately, it is the happy case that, to the best of my knowledge, no experiments thus far contradict Bayesianism, and not for the lack of trying, which is as much proof as physically possible.

Foundational issues like Bayesianism run into the old philosophy of science problems with a vengeance: which part of the total assortment of theory and observation do you choose to throw out? If someone proves a paradox in Bayesianism, do you shrug and start looking at alternatives - or do you 'defy the evidence' and patiently wait for an E.T. Jaynes to come along and explain how the paradox stems from taking an imprior limit or failing to take into account prior information etc.?

Comment author:royf
01 June 2012 04:39:03AM
0 points
[-]

(I'll adopt the seemingly rationalist trait of never taking questions as rhetorical, though both your questions strongly have that flavor).

A central part of the modern scientific method is due to Popper, who gave an essentially Bayesian answer to your first question. However, Science wouldn't fall apart if it turned out that priors aren't a physical reality. Occam's razor is non-Bayesian, and it alone accounts for a large portion of our scientific intuitions. At the bottom line, the scientific method doesn't have to be itself true in order to be effective in discovering truths and discarding falsehoods.

The concept of "proving a paradox" is unclear to me (almost a paradox in itself...). Paradoxes are mirages. Also, it seems that you have some specific piece of scientific history in mind, but I'm uncertain which.

Luckily, we did have Jaynes and others to promote what I believe to be both a compelling mathematical framework and a physical reality. Before them, well, it would be wishful to think I could hold on to Bayesian ideas in the face of apparent paradoxes. The shoulders of giants etc.

Comment author:aspera
08 October 2012 06:05:09PM
2 points
[-]

Occam's Razor is non-Bayesian? Correct me if I'm wrong, but I thought it falls naturally out of Bayesian model comparison, from the normalization factors, or "Occam factors." As I remember, the argument is something like: given two models with independent parameters {A} and {A,B}, the P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Even if the argument is wrong, I think the result ends up being that more plausible models tend to have fewer independent parameters.

Comment author:royf
08 October 2012 08:33:11PM
1 point
[-]

You're not really wrong. The thing is that "Occam's razor" is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.

P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.

More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.

Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam's razor.

Comment author:Decius
08 October 2012 08:44:24PM
0 points
[-]

How does deciding one model is true give you more information? Did you mean "If a model allows you to make more predictions about future observations, then it is a priori less likely?"

Comment author:royf
08 October 2012 09:43:09PM
0 points
[-]

How does deciding one model is true give you more information?

Let's assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for "narrowing" our belief.

The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn't help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.

Comment author:Decius
11 October 2012 04:58:37AM
0 points
[-]

At what point does the decision "This is true" diverge from the observation "There is very strong evidence for this", other than in cases where the model is accepted as true despite a lack of strong evidence?

I'm not discussing the case where a model goes from unknown to known- how does deciding to believe a model give you more information than knowing what the model is and the reason for the model. To better model an actual agent, one could replace all of the knowledge about why the model is true with the value of the strength of the supporting knowledge.

How does deciding that things always fall down give you more information than observing things fall down?

I believe the idea was to ask "hypothetically, if I found out that this hypothesis was true, how much new information would that give me?"

You'll have two or more hypotheses, and one of them is the one that would (hypothetically) give you the least amount of new information. The one that would give you the least amount of new information should be considered the "simplest" hypothesis. (assuming a certain definition of "simplest", and a certain definition of "information")

## Comments (77)

OldFoundational issues like Bayesianism run into the old philosophy of science problems with a vengeance: which part of the total assortment of theory and observation do you choose to throw out? If someone proves a paradox in Bayesianism, do you shrug and start looking at alternatives - or do you 'defy the evidence' and patiently wait for an E.T. Jaynes to come along and explain how the paradox stems from taking an imprior limit or failing to take into account prior information etc.?

(I'll adopt the seemingly rationalist trait of never taking questions as rhetorical, though both your questions strongly have that flavor).

A central part of the modern scientific method is due to Popper, who gave an essentially Bayesian answer to your first question. However, Science wouldn't fall apart if it turned out that priors aren't a physical reality. Occam's razor is non-Bayesian, and it alone accounts for a large portion of our scientific intuitions. At the bottom line, the scientific method doesn't have to be itself true in order to be effective in discovering truths and discarding falsehoods.

The concept of "proving a paradox" is unclear to me (almost a paradox in itself...). Paradoxes are mirages. Also, it seems that you have some specific piece of scientific history in mind, but I'm uncertain which.

Luckily, we did have Jaynes and others to promote what I believe to be both a compelling mathematical framework and a physical reality. Before them, well, it would be wishful to think I could hold on to Bayesian ideas in the face of apparent paradoxes. The shoulders of giants etc.

Occam's Razor is non-Bayesian? Correct me if I'm wrong, but I thought it falls naturally out of Bayesian model comparison, from the normalization factors, or "Occam factors." As I remember, the argument is something like: given two models with independent parameters {A} and {A,B}, the P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Even if the argument is wrong, I think the result ends up being that more plausible models tend to have fewer independent parameters.

You're not really wrong. The thing is that "Occam's razor" is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.

Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.

More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.

Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam's razor.

How does deciding one model is true give you more information? Did you mean "If a model allows you to make more predictions about future observations, then it is a priori less likely?"

Let's assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for "narrowing" our belief.

The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn't help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.

I'll expand on all of this in my sequence on reinforcement learning.

At what point does the decision "This is true" diverge from the observation "There is very strong evidence for this", other than in cases where the model is accepted as true

despitea lack of strong evidence?I'm not discussing the case where a model goes from unknown to known- how does

decidingto believe a model give you more information than knowing what the model is and the reason for the model. To better model an actual agent, one could replace all of the knowledge about why the model is true with the value of the strength of the supporting knowledge.How does deciding that things always fall down give you more information than observing things fall down?

*0 points [-]I believe the idea was to ask "hypothetically, if I found out that this hypothesis was true, how much new information would that give me?"

You'll have two or more hypotheses, and one of them is the one that would (hypothetically) give you the least amount of new information. The one that would give you the least amount of new information should be considered the "simplest" hypothesis. (assuming a certain definition of "simplest", and a certain definition of "information")

Crystal clear. Sorry to distract from the point.