You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

snarles comments on The trouble with Bayes (draft) - Less Wrong Discussion

10 Post author: snarles 19 October 2015 08:50PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread. Show more comments above.

Comment author: snarles 20 October 2015 09:36:50PM *  0 points [-]

I didn't reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of "which observations to make?" is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.

Comment author: RichardKennaway 23 October 2015 03:54:45PM 1 point [-]

Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

Comment author: snarles 23 October 2015 04:03:13PM *  0 points [-]

Ok. So the scenario is that you are sampling only from the population f(X)=1.

EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].

Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I'm not going to bother to do it, because it's very tedious to write out and it's frankly a homework-level problem.

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov's paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the "naive" approach which ignores the prior.

Comment author: RichardKennaway 23 October 2015 06:53:00PM 0 points [-]

The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for

That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).

I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.

However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?

It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.

It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?

A partial answer to your question:

So if the prior is weak (as it is in my main post) you don't model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?

would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don't have a mathematical formulation of how to do that though.