You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

SilentCal comments on The trouble with Bayes (draft) - Less Wrong Discussion

10 Post author: snarles 19 October 2015 08:50PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread. Show more comments above.

Comment author: SilentCal 23 October 2015 05:24:38PM 2 points [-]

Accept that the philosophically ideal thing is unattainable in this case, and do the Frequentist thing or the pragmatic-Bayesian thing.

What I actually disagree with in the post is that it seems to be making a philosophical point based on the assumption that the uniform distribution over smooth functions is better subjective Bayesianism than the pragmatic approach. I dispute that premise.

On reflection, I think the point here has to do with logical uncertainty. The argument is that the uniform distribution is 'purer' because it's something that we're more likely to choose before seeing the problem and we should be able to choose our prior before seeing the problem. But this post is a thought experiment, not a real experiment--the only knowledge it gives us is logical knowledge. I think you should be able to update your estimated priors based on new logical knowledge.

Comment author: IlyaShpitser 23 October 2015 06:34:34PM *  2 points [-]

philosophically ideal thing is unattainable in this case

Slightly confused here. Rationality is defined as winning, yes? If your "ideal thing" is not winning it's not rational, and should be dropped like a hot potato. In fact, if it's losing in what sense is it "ideal?"

Posteriors, etc. are tools, that's all.


I think the Robins/Wasserman example is about the interplay of structural assumptions of how data came to be, and statistical inference from this data (specificallly its about where information lives). In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).

This is similar to how classification problems in machine learning cannot be solved by standard methods if certain tacit assumptions (training and test data are from the same distribution) fail to hold. In that case you need to use not only standard machine learning insights about what makes a good classifier, but also additional insights that correct for the structural differences in the training and test data properly.

Comment author: SilentCal 26 October 2015 10:41:31PM 0 points [-]

In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).

I'm having trouble following this (i'm not actually that versed in statistics, and I don't know what you mean by 'assignment probability'. But it seems to me that we only think Horwitz-Thompson is a good answer because of tacit assumptions we hold about the data.

Comment author: IlyaShpitser 26 October 2015 11:08:36PM *  0 points [-]

We have X, let's say baseline facts about a person (X are features we would use to build a classifier in machine learning). We have a probability of a binary event A, conditional on X: p(A | X). If A is 1, we don't see the value of Y. If A is 0, we see the value of Y. p(A=0 | X) is what I call the "assignment probability" and p(A | X) is what the OP calls the "importance sampling distribution." It is also sometimes called "the propensity score."

And yes you are right, Horvitz-Thompson only comes into play because somehow p(A=0 | X) played a very important role in determining the data on X,Y we actually see. But if we were to write the likelihood function for X,Y, the probability p(A | X) would not appear in this function. So any method that just uses the likelihood function will ignore p(A | X). What saves Bayesians is their ability to insert p(A | X) into the prior (they have nowhere else to put it).

Comment author: SilentCal 27 October 2015 09:14:28PM 0 points [-]

Ah, R&W's pi function.

This is kind of tricky, because it doesn't seem like it should hold information, unless it correlates with R&W's theta (probability of Y = 1).

If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?

Comment author: IlyaShpitser 27 October 2015 09:25:08PM 1 point [-]

The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don't see, because the coin doesn't look at anything about Y to determine whether to come up heads.

But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don't see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).

Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).

Comment author: SilentCal 27 October 2015 09:37:08PM 0 points [-]

Sure. But if we know or suspect any correlation between A and Y, there's nothing strange about the common information between them being expressed in the prior, right?

Granted, H-T will have nice worst-case performance if we're not confident about A and Y being independent, but that reduces to this debate http://lesswrong.com/lw/k9c/can_noise_have_power/.

Comment author: jsteinhardt 29 October 2015 04:08:14AM 2 points [-]

I wrote up a pretty detailed reply to Luke's question: http://lesswrong.com/lw/kd4/the_power_of_noise/