OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.
I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn't Bayesian.
See also this:
http://www.biostat.harvard.edu/robins/coda.pdf
I think we talked about this before.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | "we assign you to a given C"].
"Causal inference is a missing data problem, and missing data is a causal inference problem."
Or I may be "missing" something. :)
Yes, I think you are missing something (although it is true that causal inference is a missing data problem).
It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.
The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.
It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], " e.g. the treatment effect on the treated." However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).