Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Eliezer_Yudkowsky comments on Bayesian Flame - Less Wrong

37 Post author: cousin_it 26 July 2009 04:49PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (155)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 26 July 2009 09:33:43PM 14 points [-]

If your prior is screwed up enough, you'll also misunderstand the experimental setup and the likelihood ratios. Frequentism depends on prior knowledge just as much as Bayesianism, it just doesn't have a good formal way of treating it.

Comment author: cousin_it 27 July 2009 06:34:02AM *  3 points [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

If tomorrow Bayesians find a good formalization of "uninformative prior" and a general formula to devise them, you'll happily discard your old bullshit prior and go with the flow, thus admitting that your careful analysis of my words about "unknown normal distribution" today wasn't relevant at all. This is the most fishy part IMO.

(Disclaimer: I am not a crazy-convinced frequentist. I'm a newbie trying to get good answers out of Bayesians, and some of the answers already given in these threads satisfy me perfectly well.)

Comment author: Cyan 27 July 2009 06:57:19AM 9 points [-]

The normal distribution with unknown mean and variance was a bad choice for this example. It's the one case where everyone agrees what the uninformative prior is. (It's flat with respect to the mean and the log-variance.) This uninformative prior is also a matching prior -- posterior intervals are confidence intervals.

Comment author: cousin_it 27 July 2009 07:27:33AM *  2 points [-]

I didn't know that was possible, thanks. (Wow, a prior with integral=infinity! One that can't be reached as a posterior after any observation! How'd a Bayesian come by that? But seems to work regardless.) What would be a better example?

ETA: I believe the point raised in that comment still deserves an answer from Bayesians.

Comment author: wedrifid 27 July 2009 12:55:57PM 1 point [-]

ETA: I believe the point raised in that comment still deserves an answer from Bayesians.

Done, but I think a more useful reply could be given if you provided an actual worked example where a frequentist tool leads you to make a different prediction than the application of Bayes would (and where you prefer the frequentist prediction.) Something with numbers in it and with the frequentist prediction provided.

Comment author: Cyan 27 July 2009 02:42:58PM *  3 points [-]

Here's one. There is one data point, distributed according to 0.5*N(0,1) + 0.5*N(mu,1).

Bayes: any improper prior for mu yields an improper posterior (because there's a 50% chance that the data are not informative about mu). Any proper prior has no calibration guarantee.

Frequentist: Neyman's confidence belt construction guarantees valid confidence coverage of the resulting interval. If the datum is close to 0, the interval may be the whole real line. This is just what we want [claims the frequentist, not me!]; after all, when the datum is close to 0, mu really could be anything.

Comment author: PhilGoetz 04 August 2009 05:30:16PM 0 points [-]

Can you explain the terms "calibration guarantee", and what "the resulting interval" is? Also, I don't understand why you say there is a 50% chance the data is not informative about mu. This is not a multi-modal distribution; it is blended from N(0,1) and N(mu,1). If mu can be any positive or negative number, then the one data point will tell you whether mu is positive or negative with probability 1.

Comment author: Cyan 04 August 2009 07:55:02PM *  2 points [-]

Can you explain the terms "calibration guarantee"...

By "calibration guarantee" I mean valid confidence coverage: if I give a number of intervals at a stated confidence, then relative frequency with which the estimated quantities fall within the interval is guaranteed to approach the stated confidence as the number of estimated quantities grows. Here we might imagine a large number of mu parameters and one datum per parameter.

... and what "the resulting interval" is?

Not easily. The second cousin of this post (a reply to wedrifid) contains a link to a paper on arXiv that gives a bare-bones overview of how confidence intervals can be constructed on page 3. When you've got that far I can tell you what interval I have in mind.

Also, I don't understand why you say there is a 50% chance the data is not informative about mu. This is not a multi-modal distribution; it is blended from N(0,1) and N(mu,1).

I think there's been a misunderstanding somewhere. Let Z be a fair coin toss. If it comes up heads the datum is generated from N(0,1); if it comes up tails, the datum is generated from N(mu,1). Z is unobserved and mu is unknown. The probability distribution of the datum is as stated above. It will be multimodal if the absolute value of mu is greater than 2 (according to some quick plots I made; I did not do a mathematical proof).

If mu can be any positive or negative number, then the one data point will tell you whether mu is positive or negative with probability 1.

If I observe the datum 0.1, is mu greater than or less than 0?

Comment author: wedrifid 29 July 2009 07:37:31PM 0 points [-]

Thanks Cyan.

I'll get back to you when (and if) I've had time to get my head around Neyman's confidence belt construction, with which I've never had cause to acquaint myself.

Comment author: Cyan 29 July 2009 08:46:59PM *  0 points [-]

This paper has a good explanation. Note that I've left one of the steps (the "ordering" that determines inclusion into the confidence belt) undetermined. I'll tell you the ordering I have in mind if you get to the point of wanting to ask me.

Comment author: wedrifid 29 July 2009 11:33:14PM 0 points [-]

That's a lot of integration to get my head around.

Comment author: Cyan 30 July 2009 12:05:02AM *  0 points [-]

All you need is page 3 (especially the figure). If you understand that in depth, then I can tell you what the confidence belt for my problem above looks like. Then I can give you a simulation algorithm and you can play around and see exactly how confidence intervals work and what they can give you.

Comment author: Erik 27 July 2009 12:39:47PM *  1 point [-]

It's called an improper prior. There's been some argument about their use but they seldom lead to problems. The posteriors usually has much better behavior at infinity and when they don't, that's the theory telling us that the information doesn't determine the solution to the problem.

The observation that an improper prior cannot be obtain as a posterior distribution is kind of trivial. It is meant to represent a total lack of information w.r.t. some parameter. As soon you have made an observation you have more information than that.

Comment author: prase 27 July 2009 03:26:28PM *  0 points [-]

Maybe the difference lies in the format of answers?

  • We know: set of n outputs of a random number generator with normal distribution. Say {3.2, 4.5, 8.1}.
  • We don't know: mean m and variance v.
  • Your proposed answer: m = 5.26, v = 6.44.
  • A Bayesian's answer: a probability distribution P(m) of the mean and another distribution Q(v) of the variance.

How does a frequentist get them? If he hasn't them, what's his confidence in m = 5.26 and v = 6.44? What if the set contains only one number - what is the frequentist's estimate for v? Note that a Bayesian has no problem even if the data set is empty, he only rests with his priors. If the data set is large, Bayesian's answer will inevitably converge at delta-function around the frequentist's estimate, no matter what the priors are.

Comment author: cousin_it 27 July 2009 03:36:43PM *  1 point [-]

http://www.xuru.org/st/DS.asp

50% confidence interval for mean: 4.07 to 6.46, stddev: 2.15 to 4.74

90% confidence interval for mean: 0.98 to 9.55, stddev: 1.46 to 11.20

If there's only one sample, the calculation fails due to division by n-1, so the frequentist says "no answer". The Bayesian says the same if he used the improper prior Cyan mentioned.

Comment author: prase 27 July 2009 03:59:26PM *  0 points [-]

Hm, should I understand it that the frequentist assumes normal distribution of the mean value with peak at the estimated 5.26?

If so, then frequentism = bayes + flat prior.

Improper priors are however quite tricky, they may lead to paradoxes such as the two-envelope paradox.

Comment author: cousin_it 27 July 2009 04:02:42PM *  0 points [-]

The prior for variance that matches the frequentist conclusion isn't flat. And even if it were, a flat prior for variance implies a non-flat prior for standard deviation and vice versa. :-)

Comment author: prase 27 July 2009 04:48:39PM 0 points [-]

Of course, I meant flat distribution of the mean. The variance cannot be negative at least.

Comment author: Cyan 27 July 2009 03:46:27PM 0 points [-]

Using the flat improper prior I was talking about before, when there's only one data point the posterior distribution is improper, so the Bayesian answer is the same as the frequentist's.

Comment author: wedrifid 27 July 2009 12:48:28PM *  1 point [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher (or whichever other statistitian created or insisted upon the use of the particular tool in question). It does not follow that the Bayesian is relying on 'Bullshit' more than the frequentist. In fact, when I use the label 'bullshit' I usually mean 'the use of authority or social power mechanisms in lieu of or in direct defiance of reason'. I obviously apply 'bullshit prior' to the frequentist option in this case.

Comment author: cousin_it 27 July 2009 02:25:13PM *  2 points [-]

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher

Why in the world doesn't a Bayesian have that option? I thought you were a free people. :-) How'd you decide to reject those priors in favor of other ones, anyway? As far as I currently understand, there's no universally accepted mathematical way to pick the best prior for every given problem and no psychologically coherent way to pick it of your head either, because it ain't there. In addition to that, here's some anecdotal evidence: I never ever heard of a Bayesian agent accepting or rejecting a prior.

Comment author: wedrifid 27 July 2009 02:56:50PM *  0 points [-]

That was a partial quote and partial paraphrase of the claim made by cousin_it (hang on, that's you! huh?). I thought that the "we are a free people and can use the frequentist implicit priors whenever they happen to be the best available" claim had been made more than enough times so I left off that nitpick and focussed on my core gripe with the post in question. That is, the suggestion that using priors because tradition tells you to makes them less 'bullshit'.

I think your inclusion of 'just' alows for the possibility that off all possible configurations of prior probabilities the frequentist one so happens to be the one worth choosing.

I never ever heard of a Bayesian agent accepting or rejecting a prior.

I'm confused. What do you mean by accepting or rejecting a prior?

Comment author: cousin_it 27 July 2009 03:07:42PM *  0 points [-]

Funny as it is, I don't contradict myself. A Bayesian doesn't have the option of skipping the prior altogether, but does have the option of picking priors with frequentist justifications, which option you call "bullshit", though for the life of me I can't tell how you can tell.

Frequentists have valid reasons for their procedures besides tradition: the procedures can be shown to always work, in a certain sense. On the other hand, I know of no Bayesian-prior-generating procedure that can be shown to work in this sense or any other sense.

I'm confused. What do you mean by accepting or rejecting a prior?

Some priors are very bad. If a Bayesian somehow ends up with such a prior, they're SOL because they have no notion of rejecting priors.

Comment author: wedrifid 27 July 2009 05:14:30PM 4 points [-]

Some priors are very bad. If a Bayesian somehow ends up with such a prior, they're SOL because they have no notion of rejecting priors.

There are two priors for A that a bayesian is unable to update from. p(A) = 0 and p(A) = 1. If a Bayesian ever assigns p(a) = 0 || 1 and are mistaken then they fail at life. No second chances. Shalizi's hypothetical agent started with the absolute (and insane) belief that the distribution was not a mix of the two gaussians in question. That did not change through the application of Bayes rule.

Bayesians cannot reject a prior of 0. They can 'reject' a prior of "That's definitely not going to happen. But if I am faced with overwhelming evidence then I may change my mind a bit." They just wouldn't write that state as p=0 or imply it through excluding it from the a simplified model without being willing to review the model for sanity afterward.

Comment author: janos 27 July 2009 03:48:23PM 0 points [-]

I am trying to understand the examples on that page, but they seem strange; shouldn't there be a model with parameters, and a prior distribution for those parameters? I don't understand the inferences. Can someone explain?

Comment author: cousin_it 27 July 2009 03:52:58PM *  0 points [-]

Well, the first example is a model with a single parameter. Roughly speaking, the Bayesian initially believes that the true model is either a Gaussian around 1, or a Gaussian around -1. The actual distribution is a mix of those two, so the Bayesian has no chance of ever arriving at the truth (the prior for the truth is zero), instead becoming over time more and more comically overconfident in one of the initial preposterous beliefs.

Comment author: orthonormal 27 July 2009 07:17:32PM 1 point [-]

Vocabulary nitpick: I believe you wrote "in luew of" in lieu of "in lieu of".

Sorry, couldn't help it. IAWYC, anyhow.

Comment author: wedrifid 27 July 2009 08:11:59PM 0 points [-]

Damn that word and its excessive vowels!