lukeprog comments on Model Combination and Adjustment - LessWrong

49 Post author: lukeprog 17 July 2013 08:31PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (35)

You are viewing a single comment's thread. Show more comments above.

Comment author: lukeprog 14 July 2013 05:33:59AM *  16 points [-]

It's an interesting exercise to look for the Bayes structure in this (and other) advice.

Yup! Practical advice is best when it's backed by deep theories.

Monteith et al. (2011) (linked in the OP) is an interesting read on the subject. They discuss a puzzle: why does the theoretically optimal Bayesian method for dealing with multiple models (that is, Bayesian model averaging) tend to underperform ad-hoc methods (e.g. "bagging" and "boosting") in empirical tests? It turns out that "Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is." The solution is simply modify the Bayesian model averaging process so that it integrates over combinations of models rather than over individual models. (They call this Bayesian model combination, to distinguish it from "normal" Bayesian model averaging.) In their tests, Bayesian model combination beats out bagging, boosting, and "normal" Bayesian model averaging.

Comment author: [deleted] 14 July 2013 03:54:55PM 11 points [-]

Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is.

Wait, what? That sounds significant. What does more than one model being correct mean?

Speculation before I read the paper:

I guess that's like modelling a process as the superposition of sub-processes? That would give the model more degrees of freedom with which to fit the data. Would we expect that to do strictly better than the mutual exclusion assumption, or does it require more data to overcome the degrees of freedom?

  • If a single theory is correct, the mutex assumption will update toward it faster by giving it a higher prior, and the probability-distribution-over-averages would get there slower, but still assigns a substantial prior to theories close to the true one.

  • On the other hand, if a combination is a better model, either because the true process is a superposition, or we are modelling something outside of our model-space, then a combination will be better able to express it. So mutex assumption will be forced to put all weight on a bad nearby theory, effectively updating in the wrong direction, whereas the combination won't lose as much because it contains more accurate models. I wonder if averaging combination will beat mutex assumption at every step?

Also interesting to note that the mutex assumption is a subset of the model space of the combination assumption, so if you are unsure which is correct, you can just add more weight to the mutex models in the combination prior and use that.

Now I'll read the paper. Let's see how I did.

Comment author: [deleted] 14 July 2013 05:16:52PM *  9 points [-]

Yup. Exactly what I thought.

when the Data Generating Model (DGM) is not one of the component models in the ensemble, BMA tends to converge to the model closest to the DGM rather than to the combination closest to the DGM [9]. He also empirically noted that, in the cases he studied, when the DGM is not one of the component models of an ensemble, there usually existed a combination of models that could more closely replicate the behavior of the DMG than could any individual model on their own.

Versus my

if a combination is a better model, either because the true process is a superposition, or we are modelling something outside of our model-space, then a combination will be better able to express it. So mutex assumption will be forced to put all weight on a bad nearby theory,

Comment author: fractalman 14 July 2013 08:45:19PM 1 point [-]

"What does more than one model being correct mean?"

maybe something like string theory? The 5 lesser theories look totally different...and then turn out to tranform into one another when you fiddle with the coupling constant.

Comment author: [deleted] 14 July 2013 10:56:21PM 2 points [-]

Seeing the words “string” and “fiddle” on top of each other primed me to think of their literal meanings, which I wouldn't otherwise consciously thought of.

Comment author: CronoDAS 15 July 2013 07:41:25AM 1 point [-]

"Bayesian model averaging struggles in practice because it accounts for uncertainty about which model is correct but still operates under the assumption that only one of them is."

Perhaps they should say "the assumption that exactly one model is perfectly correct"?