Prerequisites

This post requires some knowledge of Bayesian and Frequentist statistics, as well as probability. It is intended to explain one of the more advanced concepts in statistical theory--Bayesian non-consistency--to non-statisticians, and although the level required is much less than would be required to read some of the original papers on the topic[1], some considerable background is still required.

The Bayesian dream

Bayesian methods are enjoying a well-deserved growth of popularity in the sciences. However, most practitioners of Bayesian inference, including most statisticians, see it as a practical tool. Bayesian inference has many desirable properties for a data analysis procedure: it allows for intuitive treatment of complex statistical models, which include models with non-iid data, random effects, high-dimensional regularization, covariance estimation, outliers, and missing data. Problems which have been the subject of Ph. D. theses and entire careers in the Frequentist school, such as mixture models and the many-armed bandit problem, can be satisfactorily handled by introductory-level Bayesian statistics.

A more extreme point of view, the flavor of subjective Bayes best exemplified by Jaynes' famous book [2], and also by an sizable contingent of philosophers of science, elevates Bayesian reasoning to the methodology for probabilistic reasoning, in every domain, for every problem. One merely needs to encode one's beliefs as a prior distribution, and Bayesian inference will yield the optimal decision or inference.

To a philosophical Bayesian, the epistemological grounding of most statistics (including "pragmatic Bayes") is abysmal. The practice of data analysis is either dictated by arbitrary tradition and protocol on the one hand, or consists of users creatively employing a diverse "toolbox" of methods justified by a diverse mixture of incompatible theoretical principles like the minimax principle, invariance, asymptotics, maximum likelihood or *gasp* "Bayesian optimality." The result: a million possible methods exist for any given problem, and a million interpretations exist for any data set, all depending on how one frames the problem. Given one million different interpretations for the data, which one should *you* believe?

Why the ambiguity? Take the textbook problem of determining whether a coin is fair or weighted, based on the data obtained from, say, flipping it 10 times. Keep in mind, a principled approach to statistics decides the rule for decision-making before you see the data. So, what rule whould you use for your decision? One rule is, "declare it's weighted, if either 10/10 flips are heads or 0/10 flips are heads." Another rule is, "always declare it to be weighted." Or, "always declare it to be fair." All in all, there are 10 possible outcomes (supposing we only care about the total) and therefore there are 2^10 possible decision rules. We can probably rule out most of them as nonsensical, like, "declare it to be weighted if 5/10 are heads, and fair otherwise" since 5/10 seems like the fairest outcome possible. But among the remaining possibilities, there is no obvious way to choose the "best" rule. After all, the performance of the rule, defined as the probability you will make the correct conclusion from the data, depends on the unknown state of the world, i.e. the true probability of flipping heads for that particular the coin.

The Bayesian approach "cuts" the Gordion knot of choosing the best rule, by assuming a prior distribution over the unknown state of the world. Under this prior distribution, one can compute the average perfomance of any decision rule, and choose the best one. For example, suppose your prior is that with probability 99.9999%, the coin is fair. Then the best decision rule would be to "always declare it to be fair!"

The Bayesian approach gives you the optimal decision rule for the problem, as soon as you come up with a model for the data and a prior for your model. But when you are looking at data analysis problems in the real world (as opposed to a probability textbook), the choice of model is rarely unambiguous. Hence, for me, the standard Bayesian approach does not go far enough--if there are a million models you could choose from, you still get a million different conclusions as a Bayesian.

Hence, one could argue that a "pragmatic" Bayesian who thinks up a new model for every problem is just as epistemologically suspect as any Frequentist. Only the strongest form of subjective Bayesianism can one escape this ambiguity. The dream for the subjective Bayesian dream is to start out in life with a single model. A single prior. For the entire world. This "world prior" would contain all the entirety of one's own life experience, and the grand total of human knowledge. Surely, writing out this prior is impossible. But the point is that a true Bayesian must behave (at least approximately) as if they were driven by such a universal prior. In principle, having such an universal prior (at least conceptually) solves the problem of choosing models and priors for problems: the priors and models you choose for particular problems are determined by the posterior of your universal prior. For example, why did you decide on a linear model for your economics data? It's because according to your universal posterior, you particular economic data is well-described by such a model with high-probability.

The main practical consequence of the universal prior is that your inferences in one problem should be consistent which your inferences in another, related problem. Even if the subjective Bayesian never writes out a "grand model", their integrated approach to data analysis for related problems still distinguishes their approach from the piecemeal approach of frequentists, who tend to treat each data analysis problem as if it occurs in an isolated universe. (So I claim, though I cannot point to any real example of such a subjective Bayesian.)

Yet, even if the subjective Bayesian ideal could be realized, many philosophers of science (e.g. Deborah Mayo) would consider it just as ambiguous as non-Bayesian approaches, since even if you have an unambiguous proecdure for forming personal priors, your priors are still going to differ from mine. I don't consider this a defect, since my worldview necessarily does differ from yours. My ultimate goal is to make the best decision for myself. That said, such egocentrism, even if rationally motivated, may indeed be poorly suited for a collaborative enterprise like science.

For me, the most far more troublesome objection to the "Bayesian dream" is the question, "How would actually you go about constructing this prior that represents all of your beliefs?" Looking in the Bayesian literature, one does not find any convincing examples of any user of Bayesian inference managing to actually encode all (or even a tiny portion) of their beliefs in the form of the prior--in fact, for the most part, we see alarmingly little thought or justification being put into the construction of the priors.

Nevertheless, I myself remained one of these "hardcore Bayesians", at least from a philosophical point of view, ever since I started learning about statistics. My faith in the "Bayesian dream" persisted even after spending three years in the Ph. D. program in Stanford (a department with a heavy bias towards Frequentism) and even after I personally started doing research in frequentist methods. (I see frequentist inference as a poor man's approximation for the ideal Bayesian inference.) Though I was aware of the Bayesian non-consistency results, I largely dismissed them as mathematical pathologies. And while we were still a long way from achieving universal inference, I held the optimistic view that improved technology and theory might one day finally make the "Bayesian dream" achievable. However, I could not find a way to ignore one particular example on Wasserman's blog[3], due to its relevance to very practical problems in causal inference. Eventually I thought of an even simpler counterexample, which devastated my faith in the possibility of constructing a universal prior. Perhaps a fellow Bayesian can find a solution to this quagmire, but I am not holding my breath.

The root of the problem is the extreme degree of ignorance we have about our world, the degree of surprisingness of many true scientific discoveries, and the relative ease with which we accept these surprises. If we consider this behavior rational (which I do), then the subjective Bayesian is obligated to construct a prior which captures this behavior. Yet, the diversity of possible surprises the model must be able to accommodate makes it practically impossible (if not mathematically impossible) to construct such a prior. The alternative is to reject all possibility of surprise, and refuse to update any faster than a universal prior would (extremely slowly), which strikes me as a rather poor epistemological policy.

In the rest of the post, I'll motivate my example, sketch out a few mathematical details (explaining them as best I can to a general audience), then discuss the implications.

Introduction: Cancer classification

Biology and medicine are currently adapting to the wealth of information we can obtain by using high-throughput assays: technologies which can rapidly read the DNA of an individual, measure the concentration of messenger RNA, metabolites, and proteins. In the early days of this "large-scale" approach to biology which began with the Human Genome Project, some optimists had hoped that such an unprecedented torrent of raw data would allow scientists to quickly "crack the genetic code." By now, any such optimism has been washed away by the overwhelming complexity and uncertainty of human biology--a complexity which has been made clearer than ever by the flood of data--and replaced with a sober appreciation that in the new "big data" paradigm, making a discovery becomes a much easier task than understanding any of those discoveries.

Enter the application of machine learning to this large-scale biological data. Scientists take these massive datasets containing patient outcomes, demographic characteristics, and high-dimensional genetic, neurological, and metabolic data, and analyze them using algorithms like support vector machines, logistic regression and decision trees to learn predictive models to relate key biological variables, "biomarkers", to outcomes of interest.

To give a specific example, take a look at this abstract from the Shipp. et. al. paper on detecting survival rates for cancer patients [4]:

Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50% of patients. Prognostic models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70% versus 12%). The model also effectively delineated patients within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell−receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.

The term "supervised learning" refers to any algorithm for learning a predictive model for predicting some outcome Y(could be either categorical or numeric) from covariates or features X. In this particular paper, the authors used a relatively simple linear model (which they called "weighted voting") for prediction.

A linear model is fairly easy to interpret: it produces a single "score variable" via a weighted average of a number of predictor variables. Then it predicts the outcome (say "survival" or "no survival") based on a rule like, "Predict survival if the score is larger than 0." Yet, far more advanced machine learning models have been developed, including "deep neural networks" which are winning all of the image recognition and machine translation competitions at the moment. These "deep neural networks" are especially notorious for being difficult to interpret. Along with similarly complicated models, neural networks are often called "black box models": although you can get miraculously accurate answers out of the "box", peering inside won't give you much of a clue as to how it actually works.

Now it is time for the first thought experiment. Suppose a follow-up paper to the Shipp paper reports dramatically improved prediction for survival outcomes of lymphoma patients. The authors of this follow-up paper trained their model on a "training sample" of 500 patients, then used it to predict the five-year outcome of chemotherapy patients, on a "test sample" of 1000 patients. It correctly predicts the outcome ("survival" vs "no survival") on 990 of the 1000 patients.

Question 1: what is your opinion on the predictive accuracy of this model on the population of chemotherapy patients? Suppose that publication bias is not an issue (the authors of this paper designed the study in advance and committed to publishing) and suppose that the test sample of 1000 patients is "representative" of the entire population of chemotherapy patients.

Question 2: does your judgment depend on the complexity of the model they used? What if the authors used an extremely complex and counterintuitive model, and cannot even offer any justification or explanation for why it works? (Nevertheless, their peers have independently confirmed the predictive accuracy of the model.)

A Frequentist approach

The Frequentist answer to the thought experiment is as follows. The accuracy of the model is a probability p which we wish to estimate. The number of successes on the 1000 test patients is Binomial(p, 1000). Based on the data, one can construct a confidence interal: say, we are 99% confident that the accuracy is above 83%. What does 99% confident mean? I won't try to explain, but simply say that in this particular situation, "I'm pretty sure" that the accuracy of the model is above 83%.

A Bayesian approach

The Bayesian interjects, "Hah! You can't explain what your confidence interval actually means!" He puts a uniform prior on the probability p. The posterior distribution of p, conditional on the data, is Beta(991, 11). This gives a 99% credible interval that p is in [0.978, 0.995]. You can actually interpret the interval in probabilistic terms, and it gives a much tighter interval as well. Seems like a Bayesian victory...?

A subjective Bayesian approach

As I have argued before, a Bayesian approach which comes up with a model after hearing about the problem is bound to suffer from the same inconsistency and arbitariness as any non-Bayesian approach. You might assume a uniform distribution for p in this problem... but yet another paper comes along with a similar prediction model? You would need a join distribution for the current model and the new model. What if a theory comes along that could help explain the success of the current method? The parameter p might take a new meaning in this context.

So as a subjective Bayesian, I argue that slapping a uniform prior on the accuracy is the wrong approach. But I'll stop short of actually constructing a Bayesian model of the entire world: let's say we want to restrict our attention to this particular issue of cancer prediction. We want to model the dynamics behind cancer and cancer treatment in humans. Needless to say, the model is still ridiculously complicated. However, I don't think it's out of reach of the efforts of a well-funded, large collaborative effort of scientists.

Roughly speaking, the model can be divided into a distribution over theories of human biology, and conditional on the theory of biology, a course-grained model of an individual patient. The model would not include every cell, every molecule, etc., but it would contain many latent variables in addition to the variables measured in any particular cancer study. Let's call the variables actually measured in the study, X, and also the survival outcome, Y.

Now here is the epistemologically correct way to answer the thought experiment. Take a look at the X's and Y's of the patients in the training and test set. Update your probabilistic model of human biology based on the data. Then take a look at the actual form of the classifier: it's a function f() mapping X's to Y's. The accuracy of the classsifer is no longer parameter: it's a quantity Pr[f(X) = Y] which has a distribution under your posterior. That is, for any given "theory of human biology", Pr[f(X) = Y] has a fixed value: now, over the distribution of possible theories of human biology (based on the data of the current study as well as all previous studies and your own beliefs) Pr[f(X) = Y] has a distribution, and therefore, an average. But what will this posterior give you? Will you get something similar to the interval [0.978, 0.995] you got from the "practical Bayes" approach?

Who knows? But I would guess in all likelihood not. My guess you would get a very different interval from [0.978, 0.995], because in this complex model there is no direct link from the empirical success rate of prediction, and the quantity Pr[f(X) = Y]. But my intuition for this fact comes from the following simpler framework.

A non-parametric Bayesian approach

Instead of reasoning about a gand Bayesian model of biology, I now take a middle ground, and suggesting that while we don't need to capture the entire latent dynamics of cancer, we should at the very least we should try to include the X's and the Y's in the model, instead of merely abstracting the whole experiment as a Binomial trial (as did the frequentist and pragmatic Bayesian.) Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.

Since in this case Y is a binary outcome, a joint distribution can be decomposed as a marginal distribution over X, and a function g(x) giving the conditional probability that Y=1 given X=x. The marginal distribution is not so interesting or important for us, since it simple reflects the composition of the population of patients. For the purpose of this example, let us say that the marginal is known (e.g., a finite distribution over the population of US cancer patients). What we want to know is the probability of patient survival, and this is given by the function g(X) for the particular patient's X. Hence, we will mainly deal with constructing a prior over g(X).

To construct a prior, we need to think of intuitive properties of the survival probability function g(x). If x is similar to x', then we expect the survival probabilities to be similar. Hence the prior on g(x) should be over random, smooth functions. But we need to choose the smoothness so that the prior does not consist of almost-constant functions. Suppose for now that we choose particular class of smooth functions (e.g. functions with a certain Lipschitz norm) and choose our prior to to be uniform over functions of that smoothness. We could go further and put a prior on the smoothness hyperparameter, but for now we won't.

Now, although I assert my faithfulness to the Bayesian ideal, I still want to think about how whatever prior we choose would allow use to answer some simple though experiments. Why is that? I hold that the ideal Bayesian inference should capture and refine what I take to be "rational behavior." Hence, if a prior produces irrational outcomes, I reject that prior as not reflecting my beliefs.

Take the following thought experiment: we simply want to estimate the expected value of Y, E[Y]. Hence, we draw 100 patients independently with replacement from the population and record their outcomes: suppose the sum is 80 out of 100. The Frequentist (and prgamatic Bayesian) would end up concluding that with high probability/confidence/whatever, the expected value of Y is around 0.8, and I would hold that an ideal rationalist come up with a similar belief. But what would our non-parametric model say? We would draw a random function g(x) conditional on our particular observations: we get a quantity E[g(X)] for each instantiation of g(x): the distribution of E[g(X)]'s over the posterior allows us to make credible intervals for E[Y].

But what do we end up getting? Either one of two things happens. Either you choose too little smoothness, and E[g(X)] ends up concentrating at around 0.5, no matter what data you put into the model. This is the phenomenon of Bayesian non-consistency, and a detailed explanation can be found in several of the listed references: but to put it briefly, sampling at a few isolated points gives you too little information on the rest of the function. This example is not as pathological as the ones used in the literature: if you sample infinitely many points, you will eventually get the posterior to concentrate around the true value of E[Y], but all the same, the convergence is ridiculously slow. Alternatively, use a super-high smoothness, and the posterior of E[g(X)] has a nice interval around the sample value just like in the Binomial example. But now if you look at your posterior draws of g(x), you'll notice the functions are basically constants. Putting a prior on smoothness doesn't change things: the posterior on smoothness doesn't change, since you don't actually have enough data to determine the smoothness of the function. The posterior average of E[g(X)] is no longer always 0.5: it gets a little bit affected by the data, since within the 10% mass of the posterior corresponding to the smooth prior, the average of E[g(X)] is responding to the data. But you are still almost as slow as before in converging to the truth.

At the time that I started thinking about the above "uniform sampling" example, I was stil convinced of a Bayesian resolution. Obviously, using a uniform prior over smooth functions is too naive: you can tell by seeing that the prior distribution over E[g(X)] is already highly concentrated around 0.5. How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.

But then I thought, what about more complicated problems than computing E[Y]? What if you have to compute the expectation of Y conditional on some complicated function of X taking on a certain value: i.e. E[Y|f(X) = 1]? In the frequentist world, you can easily compute E[Y|f(X)=1] by rejection sampling: get a sample of individuals, average the Y's of the individuals whose X's satisfy f(X) = 1. But how could you formulate a prior that has the same property? For a finite collection of functions f, {f1,...,f100}, say, you might be able to construct a prior for g(x) so that the posterior for E[g(X)|fi = 1] converges to the truth for every i in {1,..,100}. I don't know how to do so, but perhaps you know. But the frequentist intervals work for every function f! Can you construct a prior which can do the same?

I am happy to argue that a true Bayesian would not need consistency for every possible f in the mathematical universe. It is cool that frequentist inference works for such a general collection: but it may well be unnecessary for the world we live in. In other words, there may be functions f which are so ridiculous, that even if you showed me that empirically, E[Y|f(X)=1] = 0.9, based on data from 1 million patients, I would not believe that E[Y|f(X)=1] was close to 0.9. It is a counterintuitive conclusion, but one that I am prepared to accept.

Yet, the set of f's which are not so ridiculous, which in fact I might accept to be reasonable based on conventional science, may be so large as to render impossible the construction of a prior which could accommodate them all. But the Bayesian dream makes the far stronger demand that our prior capture not just our current understanding of science but to match the flexibility of rational thought. I hold that given the appropriate evidence, rationalists can be persuaded to accept truths which they could not even imagine beforehand. Thinking about how we could possibly construct a prior to mimic this behavior, the Bayesian dream seems distant indeed.

Discussion

To be updated later... perhaps responding to some of your comments!

 

[1] Diaconis and Freedman, "On the Consistency of Bayes Estimates"

[2] ET Jaynes, Probability: the Logic of Science

[3] https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

[4] Shipp et al. "Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning." Nature

New Comment
58 comments, sorted by Click to highlight new comments since: Today at 11:07 PM

Thanks for writing this post! I think it contains a number of insightful points.

You seem to be operating under the impression that subjective Bayesians think you Bayesian statistical tools are always the best tools to use in different practical situations? That's likely true of many subjective Bayesians, but I don't think it's true of most "Less Wrong Bayesians." As far as I'm concerned, Bayesian statistics is not intended to handle logical uncertainty or reasoning under deductive limitation. It's an answer to the question "if you were logically omniscient, how should you reason?"

You provide examples where a deductively limited reasoner can't use Bayesian probability theory to get to the right answer, and where designing a prior that handles real-world data in a reasonable way is wildly intractable. Neat! I readily concede that deductively limited reasoners need to make use of a grab-bag of tools and heuristics depending on the situation. When a frequentist tool gets the job done fastest, I'll be first in line to use the frequentist tool. But none of this seems to bear on the philosophical question to which Bayesian probability is intended as an answer.

If someone does not yet have an understanding of thermodynamics and is still working hard to build a perpetual motion machine, then it may be quite helpful to teach them about the Carnot heat engine, as the theoretical ideal. Once it comes time for them to actually build an engine in the real world, they're going to have to resort to all sorts of hacks, heuristics, and tricks in order to build something that works at all. Then, if they come to me and say "I have lost faith in the Carnot heat engine," I'll find myself wondering what they thought the engine was for.

The situation is similar with Bayesian reasoning. For the masses who still say "you're entitled to your own opinion" or who use one argument against an army, it is quite helpful to tell them: Actually, the laws of reasoning are known. This is something humanity has uncovered. Given what you knew and what you saw, there is only one consistent assignment of probabilities to propositions. We know the most accurate way for a logically omniscient reasoner to reason. If they then go and try to do accurate reasoning, while under strong deductive limitations, they will of course find that they need to resort to all sorts of hacks, heuristics, and tricks, to reason in a way that even works at all. But if seeing this, they say "I have lost faith in Bayesian probability theory," then I'll find myself wondering about what they thought the framework was for.

From your article, I'm pretty sure you understand all this, in which case I would suggest that if you do post something like this to main, you consider a reframing. The Bayesians around these parts will very likely agree that (a) constructing a Bayesian prior that handles the real world is nigh impossible; (b) tools labeled "Bayesian" have no particular superpowers; and (c) when it comes time to solving practical real-world problems under deductive limitations, do whatever works, even if that's "frequentist".

Indeed, the Less Wrong crowd is likely going to be first in line to admit that constructing things-kinda-like-priors that can handle induction in the real world (sufficient for use in an AI system) is a massive open problem which the Bayesian framework sheds little light on. They're also likely to be quick to admit that Bayesian mechanics fails to provide an account of how deductively limited reasoners should reason, which is another gaping hole in our current understanding of 'good reasoning.'

I agree with you that deductively limited reasoners shouldn't pretend they're Bayesians. That's not what the theory is there for. It's there as a model of how logically omniscient reasoners could reason accurately, which was big news, given how very long it took humanity to think of themselves as anything like a reasoning engine designed to acquire bits of mutual information with the environment one way or another. Bayesianism is certainly not a panacea, though, and I don't think you need to convince too many people here that it has practical limitations.

That said, if you have example problems where a logically omniscient Bayesian reasoner who incorporates all your implicit knowledge into their prior would get the wrong answers, those I want to see, because those do bear on the philosophical question that I currently see Bayesian probability theory as providing an answer to--and if there's a chink in that armor, then I want to know :-)

As for the Robins / Wasserman example, here's my initial thoughts. I'm not entirely sure I'm understanding their objection correctly, but at a first pass, nothing seems amiss. I'll start by gameifying their situation, which helps me understand it better. Their situation seems to work as follows: Imagine an island with a d-dimensional surface (set d=2 for easy visualization). Anywhere along the island, we can dig for treasure, but only if that point on the island is unoccupied. At the beginning of the game, all points on the island are occupied. But people sometimes leave the points with uniform probability, in which case the point can be acquired and whoever acquires it can dig for treasure at that point. (The Xi variables on the blog are points on the island that become unoccupied during the game; we assume this is a uniformly random process.)

We're considering investing in a given treasure-digging company that's going to acquire land and dig on this island. At each point on the island, there is some probability of it having treasure. What we want to know, so that we can decide whether to invest, is how much treasure is on the island. We will first observe the treasure company acquire n points of land and dig there, and then we will decide whether to invest. (The Yi variables are the probability of treasure at the corresponding Xi. There is some function theta(x) which determines the probability of treasure at x. We want to estimate the unconditional probability that there is treasure anywhere on the island, this is psi, which is the integral of theta(x) dx.)

However, the company tries to hide facts about whether or not they actually struck treasure. What we do is, we hire a spy firm. Spies aren't perfect, though, and some points are harder to spy on than others (if they're out in the open, or have little cover, etc.) For each point on the island, there is some probability of the spies succeeding at observing the treasure diggers. We, fortunately, know exactly how likely the spies are to succeed at any given point. If the spies succeed in their observation, they tell us for sure whether the diggers found treasure. (The successes of the spies are the Ri variables. pi(x) is the probability of successfully spying at point x.)

To summarize, we have three series of variables Xi, Yi, and Ri. All are i.i.d. Yi and Ri are conditionally independent given Xi. The Xi are uniformly distributed. There is some function theta(x) which tells us how likely the there is to be treasure at any given point, and there's some other function pi(x) which tells us how likely the spies are to successfully observe x. Our task is to estimate psi, the probability of treasure at any random point on the island, which is the integral of theta(x) dx.

The game works as follows: n points x1..xn open on the island, and we observe that those points were acquired by the treasure diggers, and for some of them we send out our spy agency to maybe learn theta(xi). Robins and Wasserman argue something like the following (afaict):

"You observe finitely many instances of theta(x). But the surface of the island is continuous and huge! You've observed a teeny tiny fraction of Y-probabilities at certain points, and you have no idea how theta varies across the space, so you've basically gained zero information about theta and therefore psi."

To which I say: Depends on your prior over theta. If you assume that theta can vary wildly across the space, then observing only finitely many theta(xi) tells you almost nothing about theta in general, to be sure. In that case, you learn almost nothing by observing finitely many points -- nor should you! If instead you assume that the theta(xi) do give you lots of evidence about theta in general, then you'll end up with quite a good estimate of psi. If your prior has you somewhere in between, then you'll end up with an estimate of psi that's somewhere in between, as you should. The function pi doesn't factor in at all unless you have reason to believe that pi and theta are correlated (e.g. it's easier to spy on points that don't have treasure, or something), but Robins and Wasserman state explicitly that they don't want to consider those scenarios. (And I'm fine with assuming that pi and theta are uncorrelated.)

(The frequentist approach takes pi into account anyway and ends up eventually concentrating its probability mass mostly around one point psi in the space of possible psi values, causing me to frown very suspiciously, because we were assuming that pi doesn't tell us anything about psi.)

Robins and Wasserman then argue that the frequentist approach gives the following guarantee: No matter what function theta(x) determines the probability of treasure at x, they only need to observe finitely many points before their estimate for psi is "close" to the true psi (which they define formally). They argue that Bayesians have a very hard time generating a prior that has this property. (They note that it is possible to construct a prior that yields an estimate similar to the frequentist estimate, but that this requires torturing the prior until it gives a frequentist answer, at which point, why not just become a frequentist?)

I say, sure, it's hard (though not impossible) for a Bayesian to get that sort of guarantee. But nothing is amiss here! Two points:

(a) They claim that it's disconcerting that the theta(xi) don't give a Bayesian much information about theta. They admit that there are priors on theta that allow you to get information about theta from finitely many theta(xi), but protest that these theta are pretty weird ("very very very smooth") if the dimensionality d of the island is very high. In which case I say, if you think that the theta(xi) can't tell you much about theta, then you shouldn't be learning about theta when you learn about the various theta(xi)! In fact, I'm suspicious of anyone who says they can, under these assumptions.

Also, I'm not completely convinced that "the observations are uninformative about theta" implies "the observations are uninformative about psi" -- I acknowledge that from theta you can compute psi, and thus in some sense theta is the "only unknown," but I think you might be able to construct a prior where you learn little about theta but lots about psi. (Maybe the i.i.d. assumption rules this possibility out? I'm not sure yet, I haven't done the math.) But assume we either don't have any way of getting information about psi except by integrating theta, or that we don't have a way of doing it except one that looks "tortured" (because otherwise their argument falls through anyway). That brings us to my second point:

(b) They ask for the property that, no matter what theta is the true theta, you, after only finitely many trials, assign very high probability to the true value of psi. That's a crazy demand! What if the true theta is one where learning finitely many theta(xi) doesn't give you any information about theta? If we have a theta such that my observations are telling me nothing about it, then I don't want to be slowly concentrating all my probability mass on one particular value of psi; that would be mad. (Unless the observations are giving me information about psi via some mechanism other than information about theta, which we're assuming is not the case.)

If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we'll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the "true" value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren't informative about theta or (2) they're making an assumption that amounts to forcing us to use the "tortured" prior. I wouldn't be too surprised by (2), given that their demand on the posterior is a very frequentist demand, and so asserting that it's possible to zero in on the true psi using this data in finitely many steps for any theta may very well amount to asserting that the prior is the tortured one that forces a frequentist-looking calculation. They don't describe the "tortured prior" in the blog post, so I'm not sure what else to say here ¯\_(ツ)_/¯

There are definitely some parts of the argument I'm not following. For example, they claim that for simple functions pi, the Bayesian solution obviously works, but there's no single prior on theta which works for any pi no matter how complex. I'm very suspicious about this, and I wonder whether they mean is there's no sane prior which works for any pi, and that that's the place they're slipping the "but you can't be logically omniscient!" objection in, at which point yes, Bayesian reasoning is not the right tool. Unfortunately, I don't have any more time to spend digging at this problem. By and large, though, my conclusion is this:

If you set the game up as stated, and the observations are actually giving literally zero data about psi, then I will be sticking to my prior on psi, thankyouverymuch. If a frequentist assumes they can use pi to update and zooms off in one direction or another, then they will be wrong most of the time. If you also say the frequentist is performing well then I deny that the observations were giving no info. (By the time they've converged, the Bayesian must also have data on theta, or at least psi.) If it's possible to zero in on the true value of psi after finitely many observations, then I'm going to have to use a prior that allows me to do so, regardless of whether or not it appears tortured to you :-)

(Thanks to Benya for helping me figure out what the heck was going on here.)

If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we'll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the "true" value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren't informative about theta or (2) they're making an assumption that amounts to forcing us to use the "tortured" prior. I wouldn't be too surprised by (2),

The frequentist result does converge, and it is possible to make up a very artificial prior which allows you to converge to psi. But the fact that you can make up a prior that gives you the frequentist answer is not surprising.

A useful perspective is this: there are no Bayesian methods, and there are no frequentist methods. However, there are Bayesian justifications for methods ("it does well based in the average case") and frequentist justifications ("it does well asymptotically or in a minimax sense") for methods. If you construct a prior in order to converge to psi asymptotically, then you may be formally using Bayesian machinery, but the justification you could possibly give for your method is completely frequentist.

I understand the "no methods only justifications" view, but it's much less comforting when you need to ultimately build a reliable reasoning system :-)

I remain mostly unperturbed by this game. You made a very frequentist demand. From a Bayesian perspective, your demand is quite a strange one. If you force me to achieve it, then yeah, I may end up doing frequentist-looking things.

In attempts to steel-man the Robins/Wasserman position, it seems the place I'm supposed to be perturbed is that I can't even achieve the frequentist result unless I'm willing to make my prior for theta depend on pi, which seems to violate the spirit of Bayesian inference?

Ah, and now I think I see what's going on! The game that corresponds to a Bayesian desire for this frequentist property is not the game listed; it's the variant where theta is chosen adversarially by someone who doesn't want you to end up with a good estimate for psi. (Then the Bayesian wants a guarantee that they'll converge for every theta.) But those are precisely the situations where the Bayesian shouldn't be ignoring pi; the adversary will hide as much contrary data as they can in places that are super-difficult for the spies to observe.

Robins and Wasserman say "once a subjective Bayesian queries the randomizer (who selected pi) about the randomizer’s reasoned opinions concerning theta (but not pi) the Bayesian will have independent priors." They didn't show their math on this, but I doubt this point carries their objection. If I ask the person who selected pi how theta was selected, and they say "oh, it was selected in response to pi to cram as much important data as possible into places that are extraordinarily difficult for spies to enter," then I'm willing to buy that after updating (which I will do) I now have a distribution over theta that's independent of pi. But this new distribution will be one where I'll eventually converge to the right answer on this particular pi!

So yeah, if I'm about to start playing the treasure hunting game, and then somebody informs me that theta was actually chosen adversarially after pi was chosen, I'm definitely going to need to update on pi. Which means that if we add an adversary to the game, my prior must depend on pi. Call it forced if you will; but it seems correct to me that if you tell me the game might be adversarial (thus justifying your frequentist demand) then I will expect theta to sometimes be dependent on pi (in the most inconvenient possible way).

You made a very frequentist demand.

I don't think this is right. In the R/W example they are interested in some number. Statisticians are always interested in some number or other! A frequentist will put an interval around this number with some properties. A Bayesian will try to construct a setup where the posterior ends up concentrating around this number. The point is, it takes a Bayesian (who ignores relevant info) forever to get there, while it does not take the frequentist forever. It is not a frequentist demand that you get to the right answer in a reasonable number of samples, that's a standard demand we place on statisticial inference!

What's going wrong here for Bayesians is they are either ignoring information (which is always silly), or doing an extremely unnatural setup to not ignore information. Frequentists are quite content to exploit information outside the likelihood, Bayesians are forbidden from doing so by their framework (except in the prior of course).

Ah, and now I think I see what's going on!

I don't think this example is adversarial (in the sense of somewhat artificial constructions people do to screw up a particular algorithm). This is a very natural problem that comes up constantly. You don't have to carefully pick your assignment probability to screw up the Bayesian, either, almost any such probability would work in this example (unless it's an independent coin flip, then R/W point out Bayesians have a good solution).

In fact, I could give you an infinite family of such examples, if you wanted, by just converting causal inference problems into the R/W setup where lots of info lives outside the likelihood function.

You can't really say "oh I believe in the likelihood principle," and then rule out examples where the principle fails as unnatural or adversarial. Maybe the principle isn't so good.


I don't understand at all this business with "logical omniscience" and how it's supposed to save you.

If the Bayesian's ignoring information, then you gave them the wrong prior. As far as I can tell, the objection is that the prior over theta which doesn't ignore the information depends on pi, and intuitions say that Bayesians should think that pi should be independent from theta. But if theta can be chosen in response to pi, then the Bayesian prior over theta had better depend on pi.

I wasn't saying that this problem is "adversarial" in the "you're punishing Bayesians therefore I don't have to win" way; I agree that that would be a completely invalid argument. I was saying "if you want me to succeed even when theta is chosen by someone who doesn't like me after pi is chosen, I need a prior over theta which depends on pi." Then everything works out, except that Robins and Wasserman complain that this is torturing Bayesiansim to give a frequentist answer. To that, I shrug. You want me to get the frequentist result ("no matter which theta you pick I converge") then the result will look frequentist. Not much surprise there.

This is a very natural problem that comes up constantly.

You realize that the Bayesian gets the right answer way faster than the frequentist in situations where theta is discrete, or sufficiently smooth, or parametric, right? I doubt you find problems like this where theta is non-parametric and utterly discontinuous "naturally" or "constantly". But even if you do, the Bayesian will still succeed with a prior over theta that is independent of pi, except when the pi is so complicated and theta that is so discontinuous and so precisely tailored to hiding information in places that pi makes it very very difficult to observe that the only way you can learn theta is by knowing that it's been tailored to that particular pi. (The frequentist is essentially always assuming that theta is tailored to pi in this way, because they're essentially acting like theta might have been selected by an adversary, because that's what you do if you want to converge in all cases.) And even in that case the Bayesian can succeed by putting a prior on theta that depends on pi. What's the problem?

Imagine there's a game where the two of us will both toss an infinite number of uncorrelated fair coins, and then check which real numbers are encoded by these infinite bit sequences. Using any sane prior, I'll assign measure zero to the event "we got the same real number." If you're then like "Aha! But what if my coin actually always returns the same result as yours?" then I'm going to shrug and use a prior which assigns some non-zero probability to a correlation between our coins.

Robins and Wasserman's game is similar. We're imagining a non-parametric theta that's very difficult to learn about, which is like the first infinite coin sequence (and their example does require that it encode infinite information). Then we also imagine that there's some function pi which makes certain places easier or harder to learn about, which is like the second coin sequence. Robins and Wasserman claim, roughly, that for some finite set of observations and sufficiently complicated pi, a reasonable Bayesian will place ~zero probability on theta just happening to hide all its terrible discontinuities in that pi in just such a way that the only way you can learn theta is by knowing that it is one of the thetas that hides its information in that particular pi; this would be like the coin sequences coinciding. Fine, I agree that under sane priors and for sufficiently complex functions pi, that event has measure zero -- if theta is as unstructured as you say, it would take an infinite confluence of coincident events to make it one of the thetas that happens to hide all its important information precisely such that this particular pi makes it impossible to learn.

If you then say "Aha! Now I'm going to score you by your performance against precisely those thetas that hide in that pi!" then I'm going to shrug and require a prior which assigns some non-zero probability to theta being one of the thetas that hides its info in pi.

That normally wouldn't require any surgery to the intuitive prior (I place positive but small probability on any finite pair of sequences of coin tosses being identical), but if we're assuming that it actually takes an infinite confluence of coincident events for theta to hide its info in pi and you still want to measure me against thetas that do this, then yeah, I'm going to need a prior over theta that depends on pi. You can cry "that's violating the spirit of Bayes" all you want, but it still works.

And in the real world, I do want a prior which can eventually say "huh, our supposedly independent coins have come up the same way 2^trillion times, I wonder if they're actually correlated?" or which can eventually say "huh, this theta sure seems to be hiding lots of very important information in the places that pi makes it super hard to observe, I wonder if they're actually correlated?" so I'm quite happy to assign some (possibly very tiny) non-zero prior probability on a correlation between the two of them. Overall, I don't find this problem perturbing.

You can't really say "oh I believe in the likelihood principle," and then rule out examples where the principle fails as unnatural or adversarial.

I agree completely!

but it still works

Sure, as long as you shrug and do what works, we have nothing to discuss :).


I do agree that the insight that makes this go through is basically Frequentist, regardless of setup. All the magic happened in the prior before you started.

This comment isn't directly related to the OP, but lately my faith in Bayesian probability theory as an ideal for reasoning (under logical omniscience) has been dropping a bit, due to lack of progress on the problems of understanding what one's ideal ultimate prior represents and how it ought to be constructed or derived. It seems like one way that Bayesian probability theory could ultimately fail to be a suitable ideal for reasoning is if those problems turn out to be unsolvable.

(See http://lesswrong.com/lw/1iy/what_are_probabilities_anyway/ and http://lesswrong.com/lw/mln/aixi_can_be_arbitrarily_bad/ for more details about the kind of problems I'm talking about.)

I'm not sure how this would be failing, except in the sense that we knew from the beginning that it would fail.

Any mathematical formalization is an imperfect expression of real life. And any formalization of anything, mathematical or not, is imperfect, since all words (including mathematical terms) are vague words without a precise meaning. (Either you define a word by other words, which are themselves imprecise; or you define a word by pointing at stuff or by giving examples, which is not a precise way to define things.)

Any mathematical formalization is an imperfect expression of real life.

I think there may have been a misunderstanding here. When So8res and I used the word "ideal" we meant "normative ideal", something we should try to approximate in order to be more rational, or at least progress towards figuring out how a more rational version of ourselves would reason, not just a simplified mathematical formalism of something in real life. So Bayesian probability theory might qualify as a reasonable formalization of real world reasoning, but still fail to be a normative ideal if it doesn't represent progress towards figuring out how people ideally ought to reason.

It could represent progress towards figuring out how people ought to reason, in the sense of leaving us better off than we were before, without being able to give a perfect answer that will resolve completely and forever everything about how people ought to reason. And it seems to me that it does do that (leave us better off) in the way So8res was talking about, by at least giving us an analogy to compare our reasoning to.

Yeah, I also have nontrivial odds on "something UDTish is more fundamental than Bayesian inference" / "there are no probabilities only values" these days :-)

Sorry, I meant to imply that my faith in UDT has been dropping a bit too, due to lack of progress on the question of whether the UDT-equivalent of the Bayesian prior just represents subjective values or should be based on something objective like whether some universes has more existence than others (i.e., the "reality fluid" view), and also lack of progress on creating a normative ideal for such a "prior". (There seems to have been essentially no progress on these questions since "What Are Probabilities, Anyway?" was written about six years ago.)

I mostly agree here, though I'm probably less perturbed by the six year time gap. It seems to me like most of the effort in this space has been going towards figuring out how to handle logical uncertainty and logical counterfactuals (with some reason to believe that answers will bear on the question of how to generate priors), with comparatively little work going into things like naturalized induction that attack the problem of priors more directly.

Can you say any more about alternatives you've been considering? I can easily imagine a case where we look back and say "actually the entire problem was about generating a prior-like-thingy" but I have a harder time visualizing different tacts altogether (that don't eventually have some step that reads "then treat observations like Bayesian evidence").

Can you say any more about alternatives you've been considering?

Not much to say, unfortunately. I tried looking at some frequentist ideas for inspiration, but didn't find anything that seemed to have much bearing on the kind of philosophical problems we're trying to solve here.

Great comment, mind if I quote you later on? :)

That said, if you have example problems where a logically omniscient Bayesian reasoner who incorporates all your implicit knowledge into their prior would get the wrong answers, those I want to see, because those do bear on the philosophical question that I currently see Bayesian probability theory as providing an answer to--and if there's a chink in that armor, then I want to know :-)

It is well known where there might be chinks in the armor, which is what happens when two logically omniscient Bayesians sit down to play a a game of Poker? Bayesian game theory is still in a very developmental stage (in fact, I'm guessing it's one of the things MIRI is working on) and there could be all kinds of paradoxes lurking in wait to supplement the ones we've already encountered (e.g. two-boxing.)

Sure! I would like to clarify, though, that by "logically omniscient" I also meant "while being way larger than everything else in the universe." I'm also readily willing to admit that Bayesian probability theory doesn't get anywhere near solving decision theory, that's an entirely different can of worms where there's still lots of work to be done. (Bayesian probability theory alone does not prescribe two-boxing, in fact; that requires the addition of some decision theory which tells you how to compute the consequences of actions given a probability distribution, which is way outside the domain of Bayesian inference.)

Bayesian reasoning is an idealized method for building accurate world-models when you're the biggest thing in the room; two large open problems are (a) modeling the world when you're smaller than the universe and (b) computing the counterfactual consequences of actions from your world model. Bayesian probability theory sheds little light on either; nor is it intended to.

I personally don't think it's that useful to consider cases like "but what if there's two logically omniscient reasoners in the same room?" and then demand a coherent probability distribution. Nevertheless, you can do that, and in fact, we've recently solved that problem (Benya and Jessica Taylor will be presenting it at LORI V next week, in fact); the answer, assuming the usual decision-theoretic assumptions, is "they play Nash equilibria", as you'd expect :-)

Cool, I will take a look at the paper!

You seem to be operating under the impression that subjective Bayesians think you Bayesian statistical tools are always the best tools to use in different practical situations? That's likely true of many subjective Bayesians, but I don't think it's true of most "Less Wrong Bayesians."

I suspect that there's a large amount of variation in what "Less Wrong Bayesians" believe. It also seems that at least some treating it more as an article of faith or tribal allegiance than anything else. See for example some of the discussion here.

Update from the author:

Thanks for all of the comments and corrections! Based on your feedback, I have concluded that the article is a little bit too advanced (and possibly too narrow in focus) to be posted in the main section of the site. However, it is clear that there is a lot of interest in the general subject. Therefore, rather than posting this article to main, I think it would be more productive to write a "Philosophy of Statistics" sequence which would provide the necessary background for this kind of post.

I enjoyed your article, and learned things from you. I want to encourage you to post more on stats subjects here.

I've read up to the introduction, I'll comment as I continue.
I've found three problems so far:

  • it's not true that for objective Bayesians (the subjectives are those of the de Finetti school) any model and any prior is equally valid. The logical analysis of the problem and of the background information is the defining feature of the discipline, indeed since the inference step is reduced to the application of the product and negation rules.
    For example, in the problem you pose, we can analyze the background information and notice that: 1. we suppose that each outcome is independent; 2. we know that the coin does indeed have a head and a tail; 3. we know nothing else about the coin. These three observations alone are sufficient to decide for a single model and a single prior.
    Choosing a different model or a different prior means starting from a different background information, and that amounts to answering questions about a problem that was not posed in the first place.

  • objective Bayesianism is just the logically correct way (as per Cox theorem and further amendations) to assign probabilities to logical formulae. There's nothing in the discipline that forces anyone to find a universal model, and since one can do model comparison just as 'easily', any Bayesian can live happily in a many-models environment. What would be cool to have is a universal logical analysis tool, that is something that inputs a verbal description of the problem and outputs the most general model that is warranted by the description. The MAXENT princple is right now our best attempt at coming up with such a tool.

  • universal models already do exists, they are called universal semi-measures and the most famous of those is the Solomonoff prior. This also means that it's true that there's not a single universal model, as you said, but you can also show that any such model differs only in a finite initial 'segment', matching different initial information encoded in the universal Turing machine used to measure the Kolmogorov complexity.

I will go ahead and answer your first three questions

  1. Objective Bayesians might have "standard operating procedures" for common problems, but I bet you that I can construct realistic problems where two Objective Bayesians will disagree on how to proceed. At the very least the Objective Bayesians need an "Objective Bayesian manifesto" spelling out what are the canonical procedures. For the "coin-flipping" example, see my response to RichardKennaway where I ask whether you would still be content to treat the problem as coin-flipping if you had strong prior infromation on g(x).

  2. MaxENT is not invariant to parameterization, and I'm betting that there are examples where it works poorly. Far from being a "universal principle" it ends up being yet another heuristic joining the ranks of asymptotic optimality, minimax, minimax relative to oracle, etc. Not to say these are bad principles--each of them is very useful, but when and where to use them is still subjective.

  3. That would be great if you could implement a Solomonoff prior. It is hard to say whether implementing an approximate algorithmic prior which doesn't produce garbage is easier or harder than encoding the sum total of human scientific knowledge and heuristics into a Bayesian model, but I'm willing to bet that it is. (This third bet is not a serious bet, the first two are.)

For a technical article of this length and complexity, an abstract is always a good idea.

Beta(1000,2)

Was that meant to be Beta(1000,10)? (With appropriately updated probabilities as a result?)

Good catch, it should be Beta(991, 11). The prior is uniform = Beta(1,1 ) and the data is (990 successes, 10 fails)

Yup, sorry, I made two separate mistakes there (misparameterizing the beta distribution by alpha+beta,beta rather than alpha,beta, and the off-by-one error) -- but at least my wrong parameters were less wrong than yours :-).

It looks like you didn't replace all the distributions with the update?

Still reading, quick note:

tradion

Should be tradition?

You're violating Jaynes's Infinity Commandment:

Never introduce an infinity into a probability problem except as the limit of finite processes!

Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.

Non-parametric methods are limits of finite processes. Or, more precisely, they are rules that work for any finite data set you have. Think about using histograms to approximate a density empirically, for any dataset we have a finite number of bins, but the number of parameters depends on the size of the data. That's basically what "non-parametric" means.


Please keep your religious language out of my statistics, thank you.

It is worth noting that the issue of non-consistency is just as troublesome in the finite setting. In fact, in one of Wasserman's examples he uses a finite (but large) space for X.

There are a couple of things I'm not understanding here.

Firstly, the example of the cancer survival test seems to have some inconsistency. The fitted model is said to give the right answer in 990 out of 1000 test cases. Where do you subsequently get the Beta(1000,2) distribution from? I am not seeing the source of that 2. And given that the model is right on exactly 99% of the test cases, how is the imaginary Bayesian coming up with a clearly wrong interval [0.996,0.9998]?

Secondly, in the later example of estimating E[ Y | f(X)=1 ], the method foisted on the Bayesian appears to involve estimating the whole of the function f. This seems to me an obviously misguided approach to the problem, whatever one's views on statistical argument. Why cannot the Bayesian say, with the frequentist, it doesn't matter what f is, I have been asked about the population for which f(X)=1. I do not need to model the process f by which that population was selected, only the behaviour of Y within that population? And then proceed in the usual way.

OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.

I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn't Bayesian.

See also this:

http://www.biostat.harvard.edu/robins/coda.pdf

I think we talked about this before.

My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can't make sense of importance sampling in a Bayesian framework. My example shows that you can't make sense of "conditional sampling" in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.

We did talk about this before, that's how I first learnt of the R/W example.

I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | "we assign you to a given C"].

"Causal inference is a missing data problem, and missing data is a causal inference problem."


Or I may be "missing" something. :)

Yes, I think you are missing something (although it is true that causal inference is a missing data problem).

It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.

The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.

It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], " e.g. the treatment effect on the treated." However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).

Not following. By "importance sampling distribution" do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from "MAR" to "MCAR" (in causal inference from "conditional ignorablity" to "ignorability.") Then your procedure depends on this distribution (but your target does not, this is true). Similarly "p(y | do(a))" does not change, but the functional of the observed data equal to "p(y | do(a))" will change if you change the treatment assignment distribution.

(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called "effect of treatment on the indirectly treated" that's like that).

By "importance sampling distribution" do you mean the distribution that tells you whether Y is missing or not?

Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.

Still slightly confused.

I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don't get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:

How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.

Is this just intuition or did you write this up somewhere? That sounds very interesting.


Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn't that trivially at least as hard?

The confusion may come from mixing up my setup and Robins/Ritov's setup. There is no missing data in my setup.

I could write up my intuition for the hierarchical model. It's an almost trivial result if you don't assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don't know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won't change the result substantially.

It is quite possible that estimating E[Y] and E[Y|event] are "equivalently hard", but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.

I do not need to model the process f by which that population was selected, only the behaviour of Y within that population?

There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that "I know g(x)=h(x) with probability 1/2 or g(x) = j(x) with probability 1/2" where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don't model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?

I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it's an objective mistake to use a model involving g(x)?

Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]?

What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?

Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents "all of your background knowledge", not further analysed. f(X)=1 is playing the role of I here. It's a placeholder for the stuff we aren't modelling and within which the statistical reasoning takes place.

Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?

The example is a simplification of Wasserman's; I'm not sure if a similar answer can be made there.

BTW, I'm not a statistician, and these aren't rhetorical questions.

ETA: Here's an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?

ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?

ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there's a lot of noise where you want to look, extrapolate from where there's less noise.

But it's not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can't insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.

There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model?

That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.

How do a Bayesian or a frequentist make that decision?

I didn't reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of "which observations to make?" is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.

Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

Ok. So the scenario is that you are sampling only from the population f(X)=1.

EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].

Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I'm not going to bother to do it, because it's very tedious to write out and it's frankly a homework-level problem.

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov's paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the "naive" approach which ignores the prior.

The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for

That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).

I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.

However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?

It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.

It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?

A partial answer to your question:

So if the prior is weak (as it is in my main post) you don't model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?

would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don't have a mathematical formulation of how to do that though.

As far as I can tell it all goes off the rails when you try using a uniform distribution over functions. There's no way you actually believe all smooth random functions are equally likely--for instance, linear, quadratic, exponential, and approximate-step-function effects are probably all more likely than sinusoidal ones.

The way I see this, the demands of subjective Bayesianism as interpreted in the post are impractical. The example calculation makes its structure compatible with those demands, but at the cost of having absurd content.

On the other hand, the power of the prior isn't always bad. If one measured variable is 'phase of the moon at time of first kiss' and another is 'exposure to ionizing radiation', we should be able to express the fact that one is more likely to have an effect than the other.

So what is your Bayesian solution to the Robins/Wasserman example?

Accept that the philosophically ideal thing is unattainable in this case, and do the Frequentist thing or the pragmatic-Bayesian thing.

What I actually disagree with in the post is that it seems to be making a philosophical point based on the assumption that the uniform distribution over smooth functions is better subjective Bayesianism than the pragmatic approach. I dispute that premise.

On reflection, I think the point here has to do with logical uncertainty. The argument is that the uniform distribution is 'purer' because it's something that we're more likely to choose before seeing the problem and we should be able to choose our prior before seeing the problem. But this post is a thought experiment, not a real experiment--the only knowledge it gives us is logical knowledge. I think you should be able to update your estimated priors based on new logical knowledge.

philosophically ideal thing is unattainable in this case

Slightly confused here. Rationality is defined as winning, yes? If your "ideal thing" is not winning it's not rational, and should be dropped like a hot potato. In fact, if it's losing in what sense is it "ideal?"

Posteriors, etc. are tools, that's all.


I think the Robins/Wasserman example is about the interplay of structural assumptions of how data came to be, and statistical inference from this data (specificallly its about where information lives). In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).

This is similar to how classification problems in machine learning cannot be solved by standard methods if certain tacit assumptions (training and test data are from the same distribution) fail to hold. In that case you need to use not only standard machine learning insights about what makes a good classifier, but also additional insights that correct for the structural differences in the training and test data properly.

In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).

I'm having trouble following this (i'm not actually that versed in statistics, and I don't know what you mean by 'assignment probability'. But it seems to me that we only think Horwitz-Thompson is a good answer because of tacit assumptions we hold about the data.

We have X, let's say baseline facts about a person (X are features we would use to build a classifier in machine learning). We have a probability of a binary event A, conditional on X: p(A | X). If A is 1, we don't see the value of Y. If A is 0, we see the value of Y. p(A=0 | X) is what I call the "assignment probability" and p(A | X) is what the OP calls the "importance sampling distribution." It is also sometimes called "the propensity score."

And yes you are right, Horvitz-Thompson only comes into play because somehow p(A=0 | X) played a very important role in determining the data on X,Y we actually see. But if we were to write the likelihood function for X,Y, the probability p(A | X) would not appear in this function. So any method that just uses the likelihood function will ignore p(A | X). What saves Bayesians is their ability to insert p(A | X) into the prior (they have nowhere else to put it).

Ah, R&W's pi function.

This is kind of tricky, because it doesn't seem like it should hold information, unless it correlates with R&W's theta (probability of Y = 1).

If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?

The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don't see, because the coin doesn't look at anything about Y to determine whether to come up heads.

But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don't see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).

Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).

Sure. But if we know or suspect any correlation between A and Y, there's nothing strange about the common information between them being expressed in the prior, right?

Granted, H-T will have nice worst-case performance if we're not confident about A and Y being independent, but that reduces to this debate http://lesswrong.com/lw/k9c/can_noise_have_power/.

I wrote up a pretty detailed reply to Luke's question: http://lesswrong.com/lw/kd4/the_power_of_noise/