You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

RichardKennaway comments on The trouble with Bayes (draft) - Less Wrong Discussion

10 Post author: snarles 19 October 2015 08:50PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (58)

You are viewing a single comment's thread. Show more comments above.

Comment author: snarles 19 October 2015 10:43:23PM *  2 points [-]

I do not need to model the process f by which that population was selected, only the behaviour of Y within that population?

There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that "I know g(x)=h(x) with probability 1/2 or g(x) = j(x) with probability 1/2" where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don't model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?

I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it's an objective mistake to use a model involving g(x)?

Comment author: RichardKennaway 19 October 2015 11:21:53PM *  2 points [-]

Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]?

What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?

Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents "all of your background knowledge", not further analysed. f(X)=1 is playing the role of I here. It's a placeholder for the stuff we aren't modelling and within which the statistical reasoning takes place.

Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?

The example is a simplification of Wasserman's; I'm not sure if a similar answer can be made there.

BTW, I'm not a statistician, and these aren't rhetorical questions.

ETA: Here's an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?

ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?

ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there's a lot of noise where you want to look, extrapolate from where there's less noise.

But it's not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can't insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.