That's an interesting example, thanks for linking it. I read it carefully, and also some of Robins/Ritov CODA paper:
http://www.biostat.harvard.edu/robins/coda.pdf
and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what "adjusting for confounders" means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is "likelihood based inference.") Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above "adjustment for confounders" expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don't worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/techreport2007_6326[0%5D.pdf
So, things are working as intended :).
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
I think that the claim that any prediction can be interpreted in this minimal and consistent framework without exceptions whatsoever is a rather strong claim, I don't think I want to claim much more than that (although I do want to add that if we have such a unique framework that is both minimal and complete when it comes to making predictions then that seems like a very natural choice for Statistics with a capital s).
I don't think we're going to agree about the importance of computability without more context. I agree that every time I try to build myself a nice Bayesian algorithm I run into the problem of uncomputability, but personally I consider Bayesian statistics to be more of a method of evaluating algorithms than a method for creating them (although Bayesian statistics is by no means limited to this!).
As for your other questions: important to note is that your issues are issues with Bayesian statistics as much as they are issues with any other form of prediction making. To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes' Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes' Theorem to find the unknowns. At least, I think this is possible (it worked so far)). And indeed picking the prior and set of hypotheses is not an easy task - this is precisely what leads to different competing algorithms in the field of statistics.
Okay, this is the last thing I'll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you're updating on the data twice and it's hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?
(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I'm giving up. It is supremely dishonest to pretend there's no trade-off present in this situation. And a Bayes-first education doesn't even give you the concepts to see what you gain and what you lose by being a Bayesian.)