Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Eliezer_Yudkowsky comments on Bayesian Flame - Less Wrong

37 Post author: cousin_it 26 July 2009 04:49PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (155)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 26 July 2009 07:58:58PM 4 points [-]

"Coverage guarantees" is a frequentist concept. Can you explain where Bayesians fail by Bayesian lights? In the real world, somewhere?

Comment author: Cyan 26 July 2009 10:24:01PM 3 points [-]

How about this: a Bayesian will always predict that she is perfectly calibrated, even though she knows the theorems proving she isn't.

Comment author: Eliezer_Yudkowsky 26 July 2009 11:56:58PM 7 points [-]

A Bayesian will have a probability distribution over possible outcomes, some of which give her lower scores than her probabilistic expectation of average score, and some of which give her higher scores than this expectation.

I am unable to parse your above claim, and ask for specific math on a specific example. If you know your score will be lower than you expect, you should lower your expectation. If you know something will happen less often than the probability you assign, you should assign a lower probability. This sounds like an inconsistent epistemic state for a Bayesian to be in.

Comment author: Cyan 29 July 2009 02:32:24AM *  2 points [-]

I spent some time looking up papers, trying to find accessible ones. The main paper that kicked off the matching prior program is Welch and Peers, 1963, but you need access to JSTOR.

The best I can offer is the following example. I am estimating a large number of positive estimands. I have one noisy observation for each one; the noise is Gaussian with standard deviation equal to one. I have no information relating the estimands; per Jaynes, I give them independent priors, resulting in independent posteriors*. I do not have information justifying a proper prior. Let's say I use a flat prior over the positive real line. No matter the true value of each estimand, the sampling probability of the event "my posterior 90% quantile is greater than the estimand" is less than 0.9 (see Figure 6 of this working paper by D.A.S. Fraser). So the more estimands I analyze, the more sure I am that the intervals from 0 to my posterior 90% quantiles will contain less than 90% of the estimands.

I don't know if there's an exact matching prior in this problem, but I suspect it lacks the correct structure.

* This is a place I think Jaynes goes wrong: the quantities are best modeled as exchangeable, not independent. Equivalently, I put them in a hierarchical model. But this only kicks the problem of priors guaranteeing calibration up a level.

Comment author: Eliezer_Yudkowsky 29 July 2009 04:22:55AM 2 points [-]

I'm sorry, but the level of frequentist gibberish in this paper is larger than I would really like to work through.

If you could be so kind, please state:

What the Bayesian is using as a prior and likelihood function;

and what distribution the paper assumes the actual parameters are being drawn from, and what the real causal process is governing the appearance of evidence.

If the two don't match, then of course the Bayesian posterior distributions, relative to the experimenter's higher knowledge, can appear poorly calibrated.

If the two do match, then the Bayesian should be well-calibrated. Sure looks QED-ish to me.

Comment author: Cyan 29 July 2009 05:08:56AM *  6 points [-]

The example doesn't come from the paper; I made it myself. You only need to believe the figure I cited -- don't bother with the rest of the paper.

Call the estimands mu_1 to mu_n; the data are x_1 to x_n. The prior over the mu parameters is flat in the positive subset of R^n, zero elsewhere. The sampling distribution for x_i is Normal(mu_i,1). I don't know the distribution the parameters actually follow. The causal process is irrelevant -- I'll stipulate that the sampling distribution is known exactly.

Call the 90% quantiles of my posterior distributions q_i. From the sampling perspective, these are random quantities, being monotonic functions of the data. Their sampling distributions satisfy the inequality Pr(q_i > mu_i | mu_i) < 0.9. (This is what the figure I cited shows.) As n goes to infinity, I become more and more sure that my posterior intervals of the form (0, q_i] are undercalibrated.

You might cite the improper prior as the source of the problem. However, if the parameter space were unrestricted and the prior flat over all of R^n, the posterior intervals would by correctly calibrated.

But it really is fair to demand a proper prior. How could we determine that prior? Only by Bayesian updating from some pre-prior state of information to the prior state of information (or equivalently, by logical deduction, provided that the knowledge we update on is certain). Right away we run into the problem that Bayesian updating does not have calibration guarantees in general (and for this, you really ought to read the literature), so it's likely that any proper prior we might justify does not have a calibration guarantee.

Comment author: wedrifid 27 July 2009 01:04:01PM 1 point [-]

How about this: a Bayesian will always predict that she is perfectly calibrated, even though she knows the theorems proving she isn't.

Wanna bet? Literally. Have a Bayesian to make and a whole bunch of predictions and then offer her bets with payoffs based on what apparent calibration the results will reflect. See which bets she accepts and which she refuses.

Comment author: Cyan 27 July 2009 01:22:43PM 1 point [-]

Are you volunteering?

Comment author: wedrifid 27 July 2009 01:43:55PM 0 points [-]

Sure. :)

But let me warn you... I actually predict my calibration to be pretty darn awful.

Comment author: Cyan 27 July 2009 03:00:29PM 0 points [-]

We need a trusted third party.

Comment author: wedrifid 27 July 2009 03:23:27PM 0 points [-]

Find a candidate.

I was about to suggest we could just bet raw ego points by publicly posting here... but then I realised I prove my point just by playing.

It should be obvious, by the way, that if the predictions you have me make pertain to black boxes that you construct then I would only bet if the odds gave a money pump. There are few cases in which I would expect my calibration to be superior to what you could predict with complete knowledge of the distribution.

Comment author: Cyan 27 July 2009 03:33:34PM *  1 point [-]

It should be obvious, by the way, that if the predictions you have me make pertain to black boxes that you construct then I would only bet if the odds gave a money pump.

Phooey. There goes plan A.

Comment author: wedrifid 27 July 2009 03:56:39PM 0 points [-]

;)

Comment author: Cyan 27 July 2009 04:11:02PM 0 points [-]

Plan B involves trying to use some nasty posterior inconsistency results, so don't think you're out of the woods yet.

Comment author: cousin_it 26 July 2009 08:47:09PM *  3 points [-]

Of course not. If you choose to care only about the things Bayes can give you, it's a mathematical fact that you can't do better.

Comment author: wedrifid 26 July 2009 09:22:19PM 6 points [-]

I didn't like the "by Bayesian lights" phrase either. What I take as the relevant part of the question is this:

Can you provide an example of a frequentist concept that can be used to make predictions in the real world for which a bayesian prediction will fail?

"Bayesian answers don't give coverage guarantees" doesn't demonstrate anything by itself. The question is could the application of Bayes give a prediction equal to or superior to the prediction about the real world implicit in a coverage guarantee?

If you can provide such an example then you will have proved many people to be wrong in a significant, fundamental way. But I haven't seen anything in this thread or in either of Cyan's which fits that category.

Comment author: cousin_it 26 July 2009 09:32:16PM *  2 points [-]

Once again: the real-world performance (as opposed to internal coherence) of the Bayesian method on any given problem depends on the prior you choose for that problem. If you have a well-calibrated prior, Bayes gives well-calibrated results equal or superior to any frequentist methods. If you don't, science knows no general way to invent a prior that will reliably yield results superior to anything at all, not just frequentist methods. For example, Jaynes spent a large part of his life searching for a method to create uninformative priors with maxent, but maxent still doesn't guarantee you anything beyond "cross your fingers".

Comment author: Eliezer_Yudkowsky 26 July 2009 09:33:43PM 14 points [-]

If your prior is screwed up enough, you'll also misunderstand the experimental setup and the likelihood ratios. Frequentism depends on prior knowledge just as much as Bayesianism, it just doesn't have a good formal way of treating it.

Comment author: cousin_it 27 July 2009 06:34:02AM *  3 points [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

If tomorrow Bayesians find a good formalization of "uninformative prior" and a general formula to devise them, you'll happily discard your old bullshit prior and go with the flow, thus admitting that your careful analysis of my words about "unknown normal distribution" today wasn't relevant at all. This is the most fishy part IMO.

(Disclaimer: I am not a crazy-convinced frequentist. I'm a newbie trying to get good answers out of Bayesians, and some of the answers already given in these threads satisfy me perfectly well.)

Comment author: Cyan 27 July 2009 06:57:19AM 9 points [-]

The normal distribution with unknown mean and variance was a bad choice for this example. It's the one case where everyone agrees what the uninformative prior is. (It's flat with respect to the mean and the log-variance.) This uninformative prior is also a matching prior -- posterior intervals are confidence intervals.

Comment author: cousin_it 27 July 2009 07:27:33AM *  2 points [-]

I didn't know that was possible, thanks. (Wow, a prior with integral=infinity! One that can't be reached as a posterior after any observation! How'd a Bayesian come by that? But seems to work regardless.) What would be a better example?

ETA: I believe the point raised in that comment still deserves an answer from Bayesians.

Comment author: wedrifid 27 July 2009 12:55:57PM 1 point [-]

ETA: I believe the point raised in that comment still deserves an answer from Bayesians.

Done, but I think a more useful reply could be given if you provided an actual worked example where a frequentist tool leads you to make a different prediction than the application of Bayes would (and where you prefer the frequentist prediction.) Something with numbers in it and with the frequentist prediction provided.

Comment author: Cyan 27 July 2009 02:42:58PM *  3 points [-]

Here's one. There is one data point, distributed according to 0.5*N(0,1) + 0.5*N(mu,1).

Bayes: any improper prior for mu yields an improper posterior (because there's a 50% chance that the data are not informative about mu). Any proper prior has no calibration guarantee.

Frequentist: Neyman's confidence belt construction guarantees valid confidence coverage of the resulting interval. If the datum is close to 0, the interval may be the whole real line. This is just what we want [claims the frequentist, not me!]; after all, when the datum is close to 0, mu really could be anything.

Comment author: Erik 27 July 2009 12:39:47PM *  1 point [-]

It's called an improper prior. There's been some argument about their use but they seldom lead to problems. The posteriors usually has much better behavior at infinity and when they don't, that's the theory telling us that the information doesn't determine the solution to the problem.

The observation that an improper prior cannot be obtain as a posterior distribution is kind of trivial. It is meant to represent a total lack of information w.r.t. some parameter. As soon you have made an observation you have more information than that.

Comment author: prase 27 July 2009 03:26:28PM *  0 points [-]

Maybe the difference lies in the format of answers?

  • We know: set of n outputs of a random number generator with normal distribution. Say {3.2, 4.5, 8.1}.
  • We don't know: mean m and variance v.
  • Your proposed answer: m = 5.26, v = 6.44.
  • A Bayesian's answer: a probability distribution P(m) of the mean and another distribution Q(v) of the variance.

How does a frequentist get them? If he hasn't them, what's his confidence in m = 5.26 and v = 6.44? What if the set contains only one number - what is the frequentist's estimate for v? Note that a Bayesian has no problem even if the data set is empty, he only rests with his priors. If the data set is large, Bayesian's answer will inevitably converge at delta-function around the frequentist's estimate, no matter what the priors are.

Comment author: cousin_it 27 July 2009 03:36:43PM *  1 point [-]

http://www.xuru.org/st/DS.asp

50% confidence interval for mean: 4.07 to 6.46, stddev: 2.15 to 4.74

90% confidence interval for mean: 0.98 to 9.55, stddev: 1.46 to 11.20

If there's only one sample, the calculation fails due to division by n-1, so the frequentist says "no answer". The Bayesian says the same if he used the improper prior Cyan mentioned.

Comment author: wedrifid 27 July 2009 12:48:28PM *  1 point [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher (or whichever other statistitian created or insisted upon the use of the particular tool in question). It does not follow that the Bayesian is relying on 'Bullshit' more than the frequentist. In fact, when I use the label 'bullshit' I usually mean 'the use of authority or social power mechanisms in lieu of or in direct defiance of reason'. I obviously apply 'bullshit prior' to the frequentist option in this case.

Comment author: cousin_it 27 July 2009 02:25:13PM *  2 points [-]

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher

Why in the world doesn't a Bayesian have that option? I thought you were a free people. :-) How'd you decide to reject those priors in favor of other ones, anyway? As far as I currently understand, there's no universally accepted mathematical way to pick the best prior for every given problem and no psychologically coherent way to pick it of your head either, because it ain't there. In addition to that, here's some anecdotal evidence: I never ever heard of a Bayesian agent accepting or rejecting a prior.

Comment author: wedrifid 27 July 2009 02:56:50PM *  0 points [-]

That was a partial quote and partial paraphrase of the claim made by cousin_it (hang on, that's you! huh?). I thought that the "we are a free people and can use the frequentist implicit priors whenever they happen to be the best available" claim had been made more than enough times so I left off that nitpick and focussed on my core gripe with the post in question. That is, the suggestion that using priors because tradition tells you to makes them less 'bullshit'.

I think your inclusion of 'just' alows for the possibility that off all possible configurations of prior probabilities the frequentist one so happens to be the one worth choosing.

I never ever heard of a Bayesian agent accepting or rejecting a prior.

I'm confused. What do you mean by accepting or rejecting a prior?

Comment author: cousin_it 27 July 2009 03:07:42PM *  0 points [-]

Funny as it is, I don't contradict myself. A Bayesian doesn't have the option of skipping the prior altogether, but does have the option of picking priors with frequentist justifications, which option you call "bullshit", though for the life of me I can't tell how you can tell.

Frequentists have valid reasons for their procedures besides tradition: the procedures can be shown to always work, in a certain sense. On the other hand, I know of no Bayesian-prior-generating procedure that can be shown to work in this sense or any other sense.

I'm confused. What do you mean by accepting or rejecting a prior?

Some priors are very bad. If a Bayesian somehow ends up with such a prior, they're SOL because they have no notion of rejecting priors.

Comment author: orthonormal 27 July 2009 07:17:32PM 1 point [-]

Vocabulary nitpick: I believe you wrote "in luew of" in lieu of "in lieu of".

Sorry, couldn't help it. IAWYC, anyhow.

Comment author: wedrifid 27 July 2009 08:11:59PM 0 points [-]

Damn that word and its excessive vowels!