logistic regression is quite explicitly computing the maximum-likelihood-estimate of a parameter vector
So, it explicitly considers only P(data|model) and doesn't work with a nontrivial distribution over P(model), and it's widely used.
Suppose that there is a significant difference of P(model) across relevant models. Do you think in this case that maximizing P(model)*P(data|model) in order to get P(model|data) would be worse?
Well, there's a couple of issues here: first, logP(data|model) is a concave function for logistic regression, so unless logP(model) is also concave, the maximization may not reach the global optimum.
Secondly, the proper Bayesian thing to do would be to sample from the posterior, not maximize; for instance, in logistic regression the model is given by a vector of parameters denoted by theta. Suppose that we actually believed that the prior on theta was exp(-|theta|), where |theta| is the sum of the absolute values of the coordinates of theta. Then maximizin...
Question in title.
This is obviously subjective, but I figure there ought to be some "go-to" paper. Maybe I've even seen it once, but can't find it now and I don't know if there's anything better.
Links to multiple papers with different focus would be welcome. For my current purpose I have a preference for one that aims low and isn't too long.