Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: So8res 24 October 2015 03:48:49PM *  4 points [-]

Sure! I would like to clarify, though, that by "logically omniscient" I also meant "while being way larger than everything else in the universe." I'm also readily willing to admit that Bayesian probability theory doesn't get anywhere near solving decision theory, that's an entirely different can of worms where there's still lots of work to be done. (Bayesian probability theory alone does not prescribe two-boxing, in fact; that requires the addition of some decision theory which tells you how to compute the consequences of actions given a probability distribution, which is way outside the domain of Bayesian inference.)

Bayesian reasoning is an idealized method for building accurate world-models when you're the biggest thing in the room; two large open problems are (a) modeling the world when you're smaller than the universe and (b) computing the counterfactual consequences of actions from your world model. Bayesian probability theory sheds little light on either; nor is it intended to.

I personally don't think it's that useful to consider cases like "but what if there's two logically omniscient reasoners in the same room?" and then demand a coherent probability distribution. Nevertheless, you can do that, and in fact, we've recently solved that problem (Benya and Jessica Taylor will be presenting it at LORI V next week, in fact); the answer, assuming the usual decision-theoretic assumptions, is "they play Nash equilibria", as you'd expect :-)

Comment author: snarles 25 October 2015 06:25:48PM 1 point [-]

Cool, I will take a look at the paper!

Comment author: So8res 23 October 2015 11:44:22PM *  9 points [-]

Thanks for writing this post! I think it contains a number of insightful points.

You seem to be operating under the impression that subjective Bayesians think you Bayesian statistical tools are always the best tools to use in different practical situations? That's likely true of many subjective Bayesians, but I don't think it's true of most "Less Wrong Bayesians." As far as I'm concerned, Bayesian statistics is not intended to handle logical uncertainty or reasoning under deductive limitation. It's an answer to the question "if you were logically omniscient, how should you reason?"

You provide examples where a deductively limited reasoner can't use Bayesian probability theory to get to the right answer, and where designing a prior that handles real-world data in a reasonable way is wildly intractable. Neat! I readily concede that deductively limited reasoners need to make use of a grab-bag of tools and heuristics depending on the situation. When a frequentist tool gets the job done fastest, I'll be first in line to use the frequentist tool. But none of this seems to bear on the philosophical question to which Bayesian probability is intended as an answer.

If someone does not yet have an understanding of thermodynamics and is still working hard to build a perpetual motion machine, then it may be quite helpful to teach them about the Carnot heat engine, as the theoretical ideal. Once it comes time for them to actually build an engine in the real world, they're going to have to resort to all sorts of hacks, heuristics, and tricks in order to build something that works at all. Then, if they come to me and say "I have lost faith in the Carnot heat engine," I'll find myself wondering what they thought the engine was for.

The situation is similar with Bayesian reasoning. For the masses who still say "you're entitled to your own opinion" or who use one argument against an army, it is quite helpful to tell them: Actually, the laws of reasoning are known. This is something humanity has uncovered. Given what you knew and what you saw, there is only one consistent assignment of probabilities to propositions. We know the most accurate way for a logically omniscient reasoner to reason. If they then go and try to do accurate reasoning, while under strong deductive limitations, they will of course find that they need to resort to all sorts of hacks, heuristics, and tricks, to reason in a way that even works at all. But if seeing this, they say "I have lost faith in Bayesian probability theory," then I'll find myself wondering about what they thought the framework was for.

From your article, I'm pretty sure you understand all this, in which case I would suggest that if you do post something like this to main, you consider a reframing. The Bayesians around these parts will very likely agree that (a) constructing a Bayesian prior that handles the real world is nigh impossible; (b) tools labeled "Bayesian" have no particular superpowers; and (c) when it comes time to solving practical real-world problems under deductive limitations, do whatever works, even if that's "frequentist".

Indeed, the Less Wrong crowd is likely going to be first in line to admit that constructing things-kinda-like-priors that can handle induction in the real world (sufficient for use in an AI system) is a massive open problem which the Bayesian framework sheds little light on. They're also likely to be quick to admit that Bayesian mechanics fails to provide an account of how deductively limited reasoners should reason, which is another gaping hole in our current understanding of 'good reasoning.'

I agree with you that deductively limited reasoners shouldn't pretend they're Bayesians. That's not what the theory is there for. It's there as a model of how logically omniscient reasoners could reason accurately, which was big news, given how very long it took humanity to think of themselves as anything like a reasoning engine designed to acquire bits of mutual information with the environment one way or another. Bayesianism is certainly not a panacea, though, and I don't think you need to convince too many people here that it has practical limitations.

That said, if you have example problems where a logically omniscient Bayesian reasoner who incorporates all your implicit knowledge into their prior would get the wrong answers, those I want to see, because those do bear on the philosophical question that I currently see Bayesian probability theory as providing an answer to--and if there's a chink in that armor, then I want to know :-)

Comment author: snarles 24 October 2015 02:53:07PM 3 points [-]

Great comment, mind if I quote you later on? :)

That said, if you have example problems where a logically omniscient Bayesian reasoner who incorporates all your implicit knowledge into their prior would get the wrong answers, those I want to see, because those do bear on the philosophical question that I currently see Bayesian probability theory as providing an answer to--and if there's a chink in that armor, then I want to know :-)

It is well known where there might be chinks in the armor, which is what happens when two logically omniscient Bayesians sit down to play a a game of Poker? Bayesian game theory is still in a very developmental stage (in fact, I'm guessing it's one of the things MIRI is working on) and there could be all kinds of paradoxes lurking in wait to supplement the ones we've already encountered (e.g. two-boxing.)

Comment author: So8res 24 October 2015 02:42:36AM *  7 points [-]

As for the Robins / Wasserman example, here's my initial thoughts. I'm not entirely sure I'm understanding their objection correctly, but at a first pass, nothing seems amiss. I'll start by gameifying their situation, which helps me understand it better. Their situation seems to work as follows: Imagine an island with a d-dimensional surface (set d=2 for easy visualization). Anywhere along the island, we can dig for treasure, but only if that point on the island is unoccupied. At the beginning of the game, all points on the island are occupied. But people sometimes leave the points with uniform probability, in which case the point can be acquired and whoever acquires it can dig for treasure at that point. (The Xi variables on the blog are points on the island that become unoccupied during the game; we assume this is a uniformly random process.)

We're considering investing in a given treasure-digging company that's going to acquire land and dig on this island. At each point on the island, there is some probability of it having treasure. What we want to know, so that we can decide whether to invest, is how much treasure is on the island. We will first observe the treasure company acquire n points of land and dig there, and then we will decide whether to invest. (The Yi variables are the probability of treasure at the corresponding Xi. There is some function theta(x) which determines the probability of treasure at x. We want to estimate the unconditional probability that there is treasure anywhere on the island, this is psi, which is the integral of theta(x) dx.)

However, the company tries to hide facts about whether or not they actually struck treasure. What we do is, we hire a spy firm. Spies aren't perfect, though, and some points are harder to spy on than others (if they're out in the open, or have little cover, etc.) For each point on the island, there is some probability of the spies succeeding at observing the treasure diggers. We, fortunately, know exactly how likely the spies are to succeed at any given point. If the spies succeed in their observation, they tell us for sure whether the diggers found treasure. (The successes of the spies are the Ri variables. pi(x) is the probability of successfully spying at point x.)

To summarize, we have three series of variables Xi, Yi, and Ri. All are i.i.d. Yi and Ri are conditionally independent given Xi. The Xi are uniformly distributed. There is some function theta(x) which tells us how likely the there is to be treasure at any given point, and there's some other function pi(x) which tells us how likely the spies are to successfully observe x. Our task is to estimate psi, the probability of treasure at any random point on the island, which is the integral of theta(x) dx.

The game works as follows: n points x1..xn open on the island, and we observe that those points were acquired by the treasure diggers, and for some of them we send out our spy agency to maybe learn theta(xi). Robins and Wasserman argue something like the following (afaict):

"You observe finitely many instances of theta(x). But the surface of the island is continuous and huge! You've observed a teeny tiny fraction of Y-probabilities at certain points, and you have no idea how theta varies across the space, so you've basically gained zero information about theta and therefore psi."

To which I say: Depends on your prior over theta. If you assume that theta can vary wildly across the space, then observing only finitely many theta(xi) tells you almost nothing about theta in general, to be sure. In that case, you learn almost nothing by observing finitely many points -- nor should you! If instead you assume that the theta(xi) do give you lots of evidence about theta in general, then you'll end up with quite a good estimate of psi. If your prior has you somewhere in between, then you'll end up with an estimate of psi that's somewhere in between, as you should. The function pi doesn't factor in at all unless you have reason to believe that pi and theta are correlated (e.g. it's easier to spy on points that don't have treasure, or something), but Robins and Wasserman state explicitly that they don't want to consider those scenarios. (And I'm fine with assuming that pi and theta are uncorrelated.)

(The frequentist approach takes pi into account anyway and ends up eventually concentrating its probability mass mostly around one point psi in the space of possible psi values, causing me to frown very suspiciously, because we were assuming that pi doesn't tell us anything about psi.)

Robins and Wasserman then argue that the frequentist approach gives the following guarantee: No matter what function theta(x) determines the probability of treasure at x, they only need to observe finitely many points before their estimate for psi is "close" to the true psi (which they define formally). They argue that Bayesians have a very hard time generating a prior that has this property. (They note that it is possible to construct a prior that yields an estimate similar to the frequentist estimate, but that this requires torturing the prior until it gives a frequentist answer, at which point, why not just become a frequentist?)

I say, sure, it's hard (though not impossible) for a Bayesian to get that sort of guarantee. But nothing is amiss here! Two points:

(a) They claim that it's disconcerting that the theta(xi) don't give a Bayesian much information about theta. They admit that there are priors on theta that allow you to get information about theta from finitely many theta(xi), but protest that these theta are pretty weird ("very very very smooth") if the dimensionality d of the island is very high. In which case I say, if you think that the theta(xi) can't tell you much about theta, then you shouldn't be learning about theta when you learn about the various theta(xi)! In fact, I'm suspicious of anyone who says they can, under these assumptions.

Also, I'm not completely convinced that "the observations are uninformative about theta" implies "the observations are uninformative about psi" -- I acknowledge that from theta you can compute psi, and thus in some sense theta is the "only unknown," but I think you might be able to construct a prior where you learn little about theta but lots about psi. (Maybe the i.i.d. assumption rules this possibility out? I'm not sure yet, I haven't done the math.) But assume we either don't have any way of getting information about psi except by integrating theta, or that we don't have a way of doing it except one that looks "tortured" (because otherwise their argument falls through anyway). That brings us to my second point:

(b) They ask for the property that, no matter what theta is the true theta, you, after only finitely many trials, assign very high probability to the true value of psi. That's a crazy demand! What if the true theta is one where learning finitely many theta(xi) doesn't give you any information about theta? If we have a theta such that my observations are telling me nothing about it, then I don't want to be slowly concentrating all my probability mass on one particular value of psi; that would be mad. (Unless the observations are giving me information about psi via some mechanism other than information about theta, which we're assuming is not the case.)

If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we'll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the "true" value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren't informative about theta or (2) they're making an assumption that amounts to forcing us to use the "tortured" prior. I wouldn't be too surprised by (2), given that their demand on the posterior is a very frequentist demand, and so asserting that it's possible to zero in on the true psi using this data in finitely many steps for any theta may very well amount to asserting that the prior is the tortured one that forces a frequentist-looking calculation. They don't describe the "tortured prior" in the blog post, so I'm not sure what else to say here ¯\_(ツ)_/¯

There are definitely some parts of the argument I'm not following. For example, they claim that for simple functions pi, the Bayesian solution obviously works, but there's no single prior on theta which works for any pi no matter how complex. I'm very suspicious about this, and I wonder whether they mean is there's no sane prior which works for any pi, and that that's the place they're slipping the "but you can't be logically omniscient!" objection in, at which point yes, Bayesian reasoning is not the right tool. Unfortunately, I don't have any more time to spend digging at this problem. By and large, though, my conclusion is this:

If you set the game up as stated, and the observations are actually giving literally zero data about psi, then I will be sticking to my prior on psi, thankyouverymuch. If a frequentist assumes they can use pi to update and zooms off in one direction or another, then they will be wrong most of the time. If you also say the frequentist is performing well then I deny that the observations were giving no info. (By the time they've converged, the Bayesian must also have data on theta, or at least psi.) If it's possible to zero in on the true value of psi after finitely many observations, then I'm going to have to use a prior that allows me to do so, regardless of whether or not it appears tortured to you :-)

(Thanks to Benya for helping me figure out what the heck was going on here.)

Comment author: snarles 24 October 2015 02:43:38PM 2 points [-]

If the game is really working like they say it is, then the frequentist is often concentrating probability around some random psi for no good reason, and when we actually draw random thetas and check who predicted better, we'll see that they actually converged around completely the wrong values. Thus, I doubt the claim that, setting up the game exactly as given, the frequentist converges on the "true" value of psi. If we assume the frequentist does converge on the right answer, then I strongly suspect either (1) we should be using a prior where the observations are informative about psi even if they aren't informative about theta or (2) they're making an assumption that amounts to forcing us to use the "tortured" prior. I wouldn't be too surprised by (2),

The frequentist result does converge, and it is possible to make up a very artificial prior which allows you to converge to psi. But the fact that you can make up a prior that gives you the frequentist answer is not surprising.

A useful perspective is this: there are no Bayesian methods, and there are no frequentist methods. However, there are Bayesian justifications for methods ("it does well based in the average case") and frequentist justifications ("it does well asymptotically or in a minimax sense") for methods. If you construct a prior in order to converge to psi asymptotically, then you may be formally using Bayesian machinery, but the justification you could possibly give for your method is completely frequentist.

Comment author: RichardKennaway 23 October 2015 03:54:45PM 1 point [-]

Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

Comment author: snarles 23 October 2015 04:03:13PM *  0 points [-]

Ok. So the scenario is that you are sampling only from the population f(X)=1.

EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].

Can you exhibit a simple example of the scenario in the section "A non-parametric Bayesian approach" with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?

The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I'm not going to bother to do it, because it's very tedious to write out and it's frankly a homework-level problem.

Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?

The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov's paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the "naive" approach which ignores the prior.

Comment author: snarles 23 October 2015 03:55:27PM 6 points [-]

Update from the author:

Thanks for all of the comments and corrections! Based on your feedback, I have concluded that the article is a little bit too advanced (and possibly too narrow in focus) to be posted in the main section of the site. However, it is clear that there is a lot of interest in the general subject. Therefore, rather than posting this article to main, I think it would be more productive to write a "Philosophy of Statistics" sequence which would provide the necessary background for this kind of post.

Comment author: IlyaShpitser 21 October 2015 12:18:06AM *  2 points [-]

Still slightly confused.

I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don't get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:

How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.

Is this just intuition or did you write this up somewhere? That sounds very interesting.


Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn't that trivially at least as hard?

Comment author: snarles 21 October 2015 12:55:56AM 2 points [-]

The confusion may come from mixing up my setup and Robins/Ritov's setup. There is no missing data in my setup.

I could write up my intuition for the hierarchical model. It's an almost trivial result if you don't assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don't know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won't change the result substantially.

It is quite possible that estimating E[Y] and E[Y|event] are "equivalently hard", but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.

Comment author: RichardKennaway 20 October 2015 08:11:23PM 0 points [-]

There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model?

That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.

How do a Bayesian or a frequentist make that decision?

Comment author: snarles 20 October 2015 09:36:50PM *  0 points [-]

I didn't reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of "which observations to make?" is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.

Comment author: IlyaShpitser 20 October 2015 07:00:38PM *  1 point [-]

Not following. By "importance sampling distribution" do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from "MAR" to "MCAR" (in causal inference from "conditional ignorablity" to "ignorability.") Then your procedure depends on this distribution (but your target does not, this is true). Similarly "p(y | do(a))" does not change, but the functional of the observed data equal to "p(y | do(a))" will change if you change the treatment assignment distribution.

(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called "effect of treatment on the indirectly treated" that's like that).

Comment author: snarles 20 October 2015 07:29:02PM *  1 point [-]

By "importance sampling distribution" do you mean the distribution that tells you whether Y is missing or not?

Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.

Comment author: MrMind 20 October 2015 07:41:07AM 2 points [-]

I've read up to the introduction, I'll comment as I continue.
I've found three problems so far:

  • it's not true that for objective Bayesians (the subjectives are those of the de Finetti school) any model and any prior is equally valid. The logical analysis of the problem and of the background information is the defining feature of the discipline, indeed since the inference step is reduced to the application of the product and negation rules.
    For example, in the problem you pose, we can analyze the background information and notice that: 1. we suppose that each outcome is independent; 2. we know that the coin does indeed have a head and a tail; 3. we know nothing else about the coin. These three observations alone are sufficient to decide for a single model and a single prior.
    Choosing a different model or a different prior means starting from a different background information, and that amounts to answering questions about a problem that was not posed in the first place.

  • objective Bayesianism is just the logically correct way (as per Cox theorem and further amendations) to assign probabilities to logical formulae. There's nothing in the discipline that forces anyone to find a universal model, and since one can do model comparison just as 'easily', any Bayesian can live happily in a many-models environment. What would be cool to have is a universal logical analysis tool, that is something that inputs a verbal description of the problem and outputs the most general model that is warranted by the description. The MAXENT princple is right now our best attempt at coming up with such a tool.

  • universal models already do exists, they are called universal semi-measures and the most famous of those is the Solomonoff prior. This also means that it's true that there's not a single universal model, as you said, but you can also show that any such model differs only in a finite initial 'segment', matching different initial information encoded in the universal Turing machine used to measure the Kolmogorov complexity.

Comment author: snarles 20 October 2015 06:53:20PM *  3 points [-]

I will go ahead and answer your first three questions

  1. Objective Bayesians might have "standard operating procedures" for common problems, but I bet you that I can construct realistic problems where two Objective Bayesians will disagree on how to proceed. At the very least the Objective Bayesians need an "Objective Bayesian manifesto" spelling out what are the canonical procedures. For the "coin-flipping" example, see my response to RichardKennaway where I ask whether you would still be content to treat the problem as coin-flipping if you had strong prior infromation on g(x).

  2. MaxENT is not invariant to parameterization, and I'm betting that there are examples where it works poorly. Far from being a "universal principle" it ends up being yet another heuristic joining the ranks of asymptotic optimality, minimax, minimax relative to oracle, etc. Not to say these are bad principles--each of them is very useful, but when and where to use them is still subjective.

  3. That would be great if you could implement a Solomonoff prior. It is hard to say whether implementing an approximate algorithmic prior which doesn't produce garbage is easier or harder than encoding the sum total of human scientific knowledge and heuristics into a Bayesian model, but I'm willing to bet that it is. (This third bet is not a serious bet, the first two are.)

Comment author: CronoDAS 20 October 2015 05:43:56PM 1 point [-]

You're violating Jaynes's Infinity Commandment:

Never introduce an infinity into a probability problem except as the limit of finite processes!

Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.

Comment author: snarles 20 October 2015 06:43:57PM 2 points [-]

It is worth noting that the issue of non-consistency is just as troublesome in the finite setting. In fact, in one of Wasserman's examples he uses a finite (but large) space for X.

View more: Next