I agree with Radford Neal, model average and Bayes factors are very sensitive to the priors specification of the models, if you absolutely have to do model average methods such as PSIS-LOO or WAIC that focus on the predictive distribution are much better. If you had two identical models where one simply had a 10 times boarder uniform prior then their posterior predictive distributions would be identical but their Bayes factor would be 1/10, so a model average (assuming uniform prior on p(M_i)) would favor the narrow prior by a factor 10 where the predictive approach would correctly cobclude that they describe the data equal well and thus conclude that the models should be weighed equal.
Finally model average is usually conseptually wrong and can be solved by making a larger model that encompass all potential models, such as a hierarchical model to partial pool between the group and subject level models, gelmans 8 schools data is a good example: there are 8 schools and there are 2 simple models one with 1 parameter (all schools are the same) and one with 8 (every school is a special snow flake), and then the hierarchical model with 9 parameters, one for each school and one for how much to pool the estimates towards the group mean, gelmans radon dataset is also good for learning about hierarchical models
That seems to be a bit of conundrum: we need but we can’t compute it? If can’t compute , then what hope is there for statistics?
I can't say what O'Hagan had in mind, but the reasons I have to be skeptical of results involving Bayesian model averaging are that model averaging makes sense only if you've been very, very careful in setting up the models, and you've also been very, very careful in specifying the prior distributions these models use for their parameters. For some problems, being very, very careful may be beyond the capacity of human intellect.
Regarding the models: For complex problems, it may be the none of the models you have defined represent the real phenomenon well, even approximately. But the posterior model probabilities used in Bayesian model averaging assume that the true model is among those being considered.
If that's true (and the models have reasonable priors over their parameters), then model averaging - and it's limit of model selection when the posterior probability of one model is close to one - is a sensible thing to do. That's because the true model is always the best one to use, regardless of your purpose in doing inference.
However, if you're actually using a set of models that are all grossly inadequate, then which of these terrible models is best to use (or what weights it's best to average them with) depends on your purpose. For example, with non-linear regression models relating y to x, you might be interested in predicting y at new values of x that are negative, or in predicting y at new values of x that are positive. If you've got the true model, it's good for both positive and negative x. But if all you've got are bad models, it may be that the one that's best for negative x is not the same as the one that's best for positive x. Bayesian model averaging takes no account of your purpose, and so can't possibly do the right thing when none of the models are good.
Regarding priors: The problem is not priors for the models themselves (assuming there aren't huge numbers of them), but rather priors for the parameters within each of the models. (Note that different models may have different sets of parameters, so these priors are not necessarily parallel between models.) Once you have a fairly large amount of data, it's often the case that the exact prior for parameters that you choose isn't crucial for inference within a model - the posterior distribution for parameters may vary little over a wide class of reasonable priors (that aren't highly concentrated in some small region). You can often even get away with using an "improper" prior, such as a uniform distribution over the real numbers (which doesn't actually exist, of course).
But for computing model probabilities for use in Bayesian model averaging, the priors used for the parameters of each model are absolutely crucial. Using an overly-vague prior in which probability is spread over a wide range of parameters that mostly don't fit the data very well will give a lower model probability than if you used a more well-considered prior, that put less probability on parameters that don't fit the data well (and that weren't really plausible even a priori). Using an improper prior for parameters will generally result in the model probability being zero, since there's zero prior probability for parameters that fit the data.
Especially when the parameter space is high-dimensional, it can be quite difficult to fully express your prior knowledge about which parameter values are plausible. With a lot of thought, you maybe can do fairly well. But if you need to think hard, how can you tell whether you thought equally hard for each model? And, just thinking equally hard isn't even really enough - you need to have actually pinned down the prior that really expresses your prior knowledge, for every one of the models. Most people doing Bayesian model averaging haven't done that.
The purpose of having priors is to compensate for lack of data, so that at least you are closer to the true model a posterior, and faster training since model averaging would take longer than training a single model. Also it's not that the true model is within the ensemble of models but that you know before hand that getting a true model is rather difficult, lack of data or just the sheer complexity of the true model and parameter size. If you have enough data, playing around with different prior wouldn't make any meaningful difference. I think when people...
Pages 4-5 of the International Society for Bayesian Analysis Newsletter Vol. 5 No. 2 contains a satirical interview with Thomas Bayes. In a part of the interview, they appear to criticise the idea of model Averaging
What’s going on here? I thought Bayesians liked model averaging because it allows us to marginalise over the unknown model:
p(y|x,D)=∑ip(y|x,D,Mi) p(Mi|D)
where Mi represents the i-th model and D represents the data.