Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Bayesian Flame

37 Post author: cousin_it 26 July 2009 04:49PM

There once lived a great man named E.T. Jaynes. He knew that Bayesian inference is the only way to do statistics logically and consistently, standing on the shoulders of misunderstood giants Laplace and Gibbs. On numerous occasions he vanquished traditional "frequentist" statisticians with his superior math, demonstrating to anyone with half a brain how the Bayesian way gives faster and more correct results in each example. The weight of evidence falls so heavily on one side that it makes no sense to argue anymore. The fight is over. Bayes wins. The universe runs on Bayes-structure.

Or at least that's what you believe if you learned this stuff from Overcoming Bias.

Like I was until two days ago, when Cyan hit me over the head with something utterly incomprehensible. I suddenly had to go out and understand this stuff, not just believe it. (The original intention, if I remember it correctly, was to impress you all by pulling a Jaynes.) Now I've come back and intend to provoke a full-on flame war on the topic. Because if we can have thoughtful flame wars about gender but not math, we're a bad community. Bad, bad community.

If you're like me two days ago, you kinda "understand" what Bayesians do: assume a prior probability distribution over hypotheses, use evidence to morph it into a posterior distribution over same, and bless the resulting numbers as your "degrees of belief". But chances are that you have a very vague idea of what frequentists do, apart from deriving half-assed results with their ad hoc tools.

Well, here's the ultra-short version: frequentist statistics is the art of drawing true conclusions about the real world instead of assuming prior degrees of belief and coherently adjusting them to avoid Dutch books.

And here's an ultra-short example of what frequentists can do: estimate 100 independent unknown parameters from 100 different sample data sets and have 90 of the estimates turn out to be true to fact afterward. Like, fo'real. Always 90% in the long run, truly, irrevocably and forever. No Bayesian method known today can reliably do the same: the outcome will depend on the priors you assume for each parameter. I don't believe you're going to get lucky with all 100. And even if I believed you a priori (ahem) that don't make it true.

(That's what Jaynes did to achieve his awesome victories: use trained intuition to pick good priors by hand on a per-sample basis. Maybe you can learn this skill somewhere, but not from the Intuitive Explanation.)

How in the world do you do inference without a prior? Well, the characterization of frequentist statistics as "trickery" is totally justified: it has no single coherent approach and the tricks often give conflicting results. Most everybody agrees that you can't do better than Bayes if you have a clear-cut prior; but if you don't, no one is going to kick you out. We sympathize with your predicament and will gladly sell you some twisted technology!

Confidence intervals: imagine you somehow process some sample data to get an interval. Further imagine that hypothetically, for any given hidden parameter value, this calculation algorithm applied to data sampled under that parameter value yields an interval that covers it with probability 90%. Believe it or not, this perverse trick works 90% of the time without requiring any prior distribution on parameter values.

Unbiased estimators: you process the sample data to get a number whose expectation magically coincides with the true parameter value.

Hypothesis testing: I give you a black-box random distribution and claim it obeys a specified formula. You sample some data from the box and inspect it. Frequentism allows you to call me a liar and be wrong no more than 10% of the time reject truthful claims no more than 10% of the time, guaranteed, no prior in sight. (Thanks Eliezer for calling out the mistake, and conchis for the correction!)

But this is getting too academic. I ought to throw you dry wood, good flame material. This hilarious PDF from Andrew Gelman should do the trick. Choice quote:

Well, let me tell you something. The 50 states aren't exchangeable. I've lived in a few of them and visited nearly all the others, and calling them exchangeable is just silly. Calling it a hierarchical or multilevel model doesn't change things - it's an additional level of modeling that I'd rather not do. Call me old-fashioned, but I'd rather let the data speak without applying a probability distribution to something like the 50 states which are neither random nor a sample.

As a bonus, the bibliography to that article contains such marvelous titles as "Why Isn't Everyone a Bayesian?" And Larry Wasserman's followup is also quite disturbing.

Another stick for the fire is provided by Shalizi, who (among other things) makes the correct point that a good Bayesian must never be uncertain about the probability of any future event. That's why he calls Bayesians "Often Wrong, Never In Doubt":

The Bayesian, by definition, believes in a joint distribution of the random sequence X and of the hypothesis M. (Otherwise, Bayes's rule makes no sense.) This means that by integrating over M, we get an unconditional, marginal probability for f.

For my final quote it seems only fair to add one more polemical summary of Cyan's point that made me sit up and look around in a bewildered manner. Credit to Wasserman again:

Pennypacker: You see, physics has really advanced. All those quantities I estimated have now been measured to great precision. Of those thousands of 95 percent intervals, only 3 percent contained the true values! They concluded I was a fraud.

van Nostrand: Pennypacker you fool. I never said those intervals would contain the truth 95 percent of the time. I guaranteed coherence not coverage!

Pennypacker: A lot of good that did me. I should have gone to that objective Bayesian statistician. At least he cares about the frequentist properties of his procedures.

van Nostrand: Well I'm sorry you feel that way Pennypacker. But I can't be responsible for your incoherent colleagues. I've had enough now. Be on your way.

There's often good reason to advocate a correct theory over a wrong one. But all this evidence (ahem) shows that switching to Guardian of Truth mode was, at the very least, premature for me. Bayes isn't the correct theory to make conclusions about the world. As of today, we have no coherent theory for making conclusions about the world. Both perspectives have serious problems. So do yourself a favor and switch to truth-seeker mode.

Comments (155)

Comment author: Insert_Idionym_Here 03 December 2011 02:33:44AM 0 points [-]

... What is it that frequentists do, again? I'm a little out of touch.

Comment author: AlexaKhan 28 July 2009 05:49:51PM *  5 points [-]

That's what Jaynes did to achieve his awesome victories: use trained intuition to pick good priors by hand on a per-sample basis.

... as if applying the classical method doesn't require using trained intuition to use the "right" method for a particular kind of problem, which amounts to choosing a prior but doing it implicitly rather than explicitly ...

Our inference is conditional on our assumptions [for example, the prior P(Lambda)]. Critics view such priors as a difficulty because they are `subjective', but I don't see how it could be otherwise. How can one perform inference without making assumptions? I believe that it is of great value that Bayesian methods force one to make these tacit assumptions explicit.

McKay, information theory, learning and inference

Comment author: cousin_it 04 August 2009 11:57:18AM *  1 point [-]

Frequentist methods often have mathematical justifications, so Bayesian priors should have them too.

Comment author: Wei_Dai 28 July 2009 04:31:01AM 1 point [-]

I'm surprised that nobody has mentioned the Universal Prior yet. Eliezer also wrote a post on it.

Comment author: janos 27 July 2009 04:56:36PM *  3 points [-]

Since we're discussing (among other things) noninformative priors, I'd like to ask: does anyone know of a decent (noninformative) prior for the space of stationary, bidirectionally infinite sequences of 0s and 1s?

Of course in any practical inference problem it would be pointless to consider the infinite joint distribution, and you'd only need to consider what happens for a finite chunk of bits, i.e. a higher-order Markov process, described by a bunch of parameters (probabilities) which would need to satisfy some linear inequalities. So it's easy to find a prior for the space of mth-order Markov processes on {0,1}; but these obvious (uniform) priors aren't coherent with each other.

I suppose it's possible to normalize these priors so that they're coherent, but that seems to result in much ugliness. I just wonder if there's a more elegant solution.

Comment author: marks 28 July 2009 06:40:29AM *  1 point [-]

I suppose it depends what you want to do, first I would point out that the set is in a bijection with the real numbers (think of two simple injections and then use Cantor–Bernstein–Schroeder), so you can use any prior over the real numbers. The fact that you want to look at infinite sequences of 0s and 1s seems to imply that you are considering a specific type of problem that would demand a very particular meaning of 'non-informative prior'. What I mean by that is that any 'noninformative prior' usually incorporates some kind of invariance: e.g. a uniform prior on [0,1] for a Bernoulli distribution is invariant with respect to the true value being anywhere in the interval.

Comment author: janos 28 July 2009 03:42:44PM 2 points [-]

The purpose would be to predict regularities in a "language", e.g. to try to achieve decent data compression in a way similar to other Markov-chain-based approaches. In terms of properties, I can't think of any nontrivial ones, except the usual important one that the prior assign nonzero probability to every open set; mainly I'm just trying to find something that I can imagine computing with.

It's true that there exists a bijection between this space and the real numbers, but it doesn't seem like a very natural one, though it does work (it's measurable, etc). I'll have to think about that one.

Comment author: marks 29 July 2009 04:11:17AM 1 point [-]

What topology are you putting on this set?

I made the point about the real numbers because it shows that putting a non-informative prior on the infinite bidirectional sequences should be at least as hard as for the real numbers (which is non-trivial).

Usually a regularity is defined in terms of a particular computational model, so if you picked Turing machines (or the variant that works with bidirectional infinite tape, which is basically the same class as infinite tape in one direction), then you could instead begin constructing your prior in terms of Turing machines. I don't know if that helps any.

Comment author: janos 29 July 2009 06:04:34AM *  1 point [-]

Each element of the set is characterized by a bunch of probabilities; for example there is p_01101, which is the probability that elements x_{i+1} through x_{i+5} are 01101, for any i. I was thinking of using the topology induced by these maps (i.e. generated by preimages of open sets under them).

How is putting a noninformative prior on the reals hard? With the usual required invariance, the uniform (improper) prior does the job. I don't mind having the prior be improper here either, and as I said I don't know what invariance I should want; I can't think of many interesting group actions that apply. Though of course 0 and 1 should be treated symmetrically; but that's trivial to arrange.

I guess you're right that regularities can be described more generally with computational models; but I expect them to be harder to deal with than this (relatively) simple, noncomputational (though stochastic) model. I'm not looking for regularities among the models, so I'm not sure how a computational model would help me.

Comment author: marks 05 August 2009 06:00:25AM 0 points [-]

One issue with say taking a normal distribution and letting the variance go to infinity (which is the improper prior I normally use) is that the posterior distribution distribution is going to have a finite mean, which may not be a desired property of the resulting distribution.

You're right that there's no essential reason to relate things back to the reals, I was just using that to illustrate the difficulty.

I was thinking about this a little over the last few days and it occurred to me that one model for what you are discussing might actually be an infinite graphical model. The infinite bi-directional sequence here are the values of bernoulli-distributed random variables. Probably the most interesting case for you would be a Markov-random field, as the stochastic 'patterns' you were discussing may be described in terms of dependencies between random variables.

Here's three papers I read a little while back on the topic (and related to) something called an Indian Buffet process: (http://www.cs.utah.edu/~hal/docs/daume08ihfrm.pdf) (http://cocosci.berkeley.edu/tom/papers/ibptr.pdf) (http://www.cs.man.ac.uk/~mtitsias/papers/nips07.pdf)

These may not quite be what you are looking for since they deal with a bound on the extent of the interactions, you probably want to think about probability distributions of binary matrices with an infinite number of rows and columns (which would correspond to an adjacency matrix over an infinite graph).

Comment author: cousin_it 29 July 2009 07:33:12AM *  2 points [-]

Something about this discussion reminds me of a hilarious text:

Now having no reason to otherwise, I decided to assign each of the 64 sequences a prior probability of 1/64 of occurring. Now, of course, You may think otherwise but that is Your business and not My concern. (I, as a Bayesian, have a tendency to capitalise pronouns but I don't care what You think. Strictly speaking, as a new convert to subjectivist philosophy, I don't even care whether you are a Bayesian. In fact it is a bit of mystery as to why we Bayesians want to convert anybody. But then "We" is in any case a meaningless concept. There is only I and I don't care whether this digression has confused You.) I then set about acquiring some experience with the coin. Now as De Finetti (vol 1 p141) points out, "experience, since experience is nothing more than the acquisition of further information - acts always and only in the way we have just described: suppressing the alternatives that turn out to be no longer possible..." (His italics)

Now of the 64 sequences, 32 end in a head. Therefore, before tossing the coin my prevision of the 6th toss was 32/64. I tossed the coin once and it came up heads. I thus immediately suppressed 32 alternative sequences beginning with a tail (which clearly hadn't occurred) leaving 32 beginning with a head of which 16 ended with a head. Thus my prevision for the 6th toss was now 16/32. (Of course, for a single toss the number of heads can only be 0 or 1 but THINK prevision is not prediction anymore than perversion is predilection.) I then tossed the coin and it came up heads. This immediately eliminated 16 sequences, leaving 16 beginning with 2 heads, 8 of which ended in a head. My prevision of the 6th toss was thus 8/16. I carried on like this, obtaining a head on each of the next three goes and amending my prevision to 4/8, 2/4 and 1/2 which is where I then was after the 5th toss having obtained 5 heads in a row.

The moral of this story seems to be, Assume priors over generators, not over sequences. A noninformative prior over the reals will never learn that the digit after 0100 is more likely to be 1, no matter how much data you feed it.

Comment author: janos 04 August 2009 02:31:12PM *  2 points [-]

Right, that is a good piece. But I'm afraid I was unclear. (Sorry if I was.) I'm looking for a prior over stationary sequences of digits, not just sequences. I guess the adjective "stationary" can be interpreted in two compatible ways: either I'm talking about sequences such that for every possible string w the proportion of substrings of length |w| that are equal to |w|, among all substrings of length |w|, tends to a limit as you consider more and more substrings (either extending forward or backward in the sequence); this would not quite be a prior over generators, and isn't what I meant.

The cleaner thing I could have meant (and did) is the collection of stationary sequence-valued random variables, each of which (up to isomorphism) is completely described by the probabilities p_w of a string of length |w| coming up as w. These, then, are generators.

Comment author: cousin_it 07 August 2009 12:11:08PM *  0 points [-]

Janos, I spent some days parsing your request and it's quite complex. Cosma Shalizi's thesis and algorithm seem to address your problem in a frequentist manner, but I can't yet work out any good Bayesian solution.

Comment author: PhilGoetz 27 July 2009 04:14:08PM 7 points [-]

Can someone do something I've never seen anyone do - lay out a simple example in which the Bayesian and frequentist approaches give different answers?

Comment author: marks 28 July 2009 06:27:26AM 9 points [-]

I've had some training in Bayesian and Frequentist statistics and I think I know enough to say that it would be difficult to give a "simple" and satisfying example. The reason is that if one is dealing with finite dimensional statistical models (this is where the parameter space of the model is finite) and one has chosen a prior for those parameters such that there is non-zero weight on the true values then the Bernstein-von Mises theorem guarantees that the Bayesian posterior distribution and the maximum likelihood estimate converge to the same probability distribution (although you may need to use improper priors). The covers cases where we consider finite outcomes such as a toss of a coin or rolling a die.

I apologize if that's too much jargon, but for really simple models that are easy to specify you tend to get the same answer. Bayesian stats starts to behave different than frequentist statistics in noticeable ways when you consider infinite outcome spaces. An example here might be where you are considering probability distributions over curves (this arises in my research on speech recognition). In this case even if you have a seemingly sensible prior you can end up in the case where, in the limit of infinite data, you will end up with a posterior distribution that is different from the true distribution.

In practice if I am learning a Gaussian Mixture Model for speech curves and I don't have much data then Bayesian procedures tend to be a bit more robust and frequentist procedures end up over-fitting (or being somewhat random). When I start getting more data using frequentist methods tend to be algorithmically more tractable and get better results. So I'll end with faster computation time and say on the task of phoneme recognition I'll make fewer errors.

I'm sorry if I haven't explained it well, the difference in performance wasn't really evident to me until I spent some time actually using them in machine learning. Unfortunately, most of the disadvantage of Bayesian approaches aren't evident for simple statistical problems, but they become all too evident in the case of complex statistical models.

Comment author: PhilGoetz 04 August 2009 05:22:22PM *  1 point [-]

Thanks much!

and one has chosen a prior for those parameters such that there is non-zero weight on the true values then the Bernstein-von Mises theorem guarantees that the Bayesian posterior distribution and the maximum likelihood estimate converge to the same probability distribution (although you may need to use improper priors)

What do "non-zero weight" and "improper priors" mean?

EDIT: Improper priors mean priors that don't sum to one. I would guess "non-zero weight" means "non-zero probability". But then I would wonder why anyone would introduce the term "weight". Perhaps "weight" is the term you use to express a value from a probability density function that is not itself a probability.

Comment author: marks 05 August 2009 05:42:21AM *  1 point [-]

No problem.

Improper priors are generally only considered in the case of continuous distributions so 'sum' is probably not the right term, integrate is usually used.

I used the term 'weight' to signify an integral because of how I usually intuit probability measures. Say you have a random variable X that takes values in the real line, the probability that it takes a value in some subset S of the real line would be the integral of S with respect to the given probability measure.

There's a good discussion of this way of viewing probability distributions in the wikipedia article. There's also a fantastic textbook on the subject that really has made a world of difference for me mathematically.

Comment author: Cyan 27 July 2009 04:17:09PM 0 points [-]

How about this?

Comment author: RichardKennaway 27 July 2009 09:30:56AM 0 points [-]

Strong evidence can always defeat strong priors, and vice versa.

Is there anything more to the issue than this?

Comment author: marks 28 July 2009 06:33:00AM 0 points [-]

This isn't always the case if the prior puts zero probability weight on the true model. This can be avoided on finite outcome spaces, but for infinite outcome spaces no matter how much evidence you have you may not overcome the prior.

Comment author: RichardKennaway 28 July 2009 11:13:19AM 1 point [-]

I thought that 0 and 1 were Bayesian sins, unattainable +/- infinity on the log-odds scale, and however strong your priors, you never make them that strong.

Comment author: marks 28 July 2009 03:49:46PM *  0 points [-]

In finite dimensional parameter spaces sure, this makes perfect sense. But suppose that we are considering a stochastic process X1, X2, X3, .... where Xn is follows a distribution Pn over the integers. Now put a prior on the distribution and suppose that unbeknown to you Pn is the distribution that puts 1/2 probability weight on -n and 1/2 probability weight on n. If the prior on the stochastic process does not put increasing weight on integers with large absolute value, then in the limit the prior puts zero probability weight on the true distribution (and may start behaving strangely quite early on in the process).

Another case is that the true probability model may be too complicated to write down or computationally infeasible to do so (say a Gaussian mixture with 10^(10) mixture components, which is certainly reasonable in a modern high-dimensional database), so one may only consider probability distributions that approximate the true distribution and put zero weight on the true model, i.e. it would be sensible in that case to have a prior that may put zero weight on the true model and you would search only for an approximation.

Comment author: Cyan 26 July 2009 10:15:19PM *  7 points [-]

I didn't mean to rehabilitate frequentism! I only meant to point out that calibration is a frequentist optimality criterion, and one that Bayesian posterior intervals can be proved not to have in general. I view this as a bullet to be bitten, not dodged.

Comment author: Vladimir_Nesov 26 July 2009 10:23:04PM 10 points [-]

It's out of your hands now. Overcoming Bayes!

Comment deleted 26 July 2009 10:13:51PM [-]
Comment author: cousin_it 26 July 2009 10:19:56PM *  2 points [-]

Too late. I have already updated to believe that a theory that demands priors can't be complete. Correct, maybe, but not complete. We should work out an approach that works well on more criteria instead of guarding the truth of what we already know.

If Bayes were the complete answer, Jaynes wouldn't have felt the need to invent maxent or generalize the indifference principle. That may be the correct direction of inquiry.

ETA: this was a response to Cyan saying he didn't mean to rehabilitate frequentism. :-)

Comment author: janos 27 July 2009 03:55:30PM 6 points [-]

Updated, eh? Where did your prior come from? :)

Comment author: RolfAndreassen 26 July 2009 10:07:36PM *  6 points [-]

I had another thought on the subject. Consider flipping a coin; a Bayesian says that the 50% estimate of getting tails is just your own inability to predict with sufficient accuracy; a frequentist says that the 50% is a property of the coin - or to be less straw-making about it, a property of large sets of indistinguishable coin-flips. So, ok, in principle you could build a coin-predictor and remove the uncertainty. But now consider an electron passing through a beam splitter. Here there is no method even in principle of predicting which Everett branch you find yourself in. (Given some reasonable assumptions about locality and such.) The coin has hidden variables like the precise location of your thumb and the exact force your muscles apply to it; if you were smart enough, you could tease a prediction out of them. But an electron has no such hidden properties. Is it not reasonable, then, to say that the 50% chance really is a property of the electron, and not the predictor?

Comment author: prase 27 July 2009 02:14:29PM *  0 points [-]

Finally, the electron is found at some certain polarisation. You just don't know which before actually doing the experiment (same as for the coin) and you can't make in principle (at least according to present model of physics - don't forget that non-local hidden variables are not ruled out) any observation which tells you the result with more certainty in advance (for coin you can). So, the difference is that the future of a classical system can be predicted with unlimited certainty from its present state, while for quantum system not so. This doesn't necessarily mean that the future is not determined. One can adopt the viewpoint (I think that it was even suggested on OB/LW in Eliezer's posts about timeless physics) that future is symmetric to the past - it exists in the whole history of universe, and if we don't know it now, it's our ignorance. I suppose you would agree that not knowing about the electron's past is a matter of our ignorance rather than a property of the electron itself, without regard to whether we are able to calculate it from presently available information, even in principle (i.e. using present theories).

I also think that it has little merit to engage in discussions about terminology and this one tends in that direction. Practically there's no difference between saying that quantum probabilities are "properties of the system" or "of the predictor". Either we can predict, or not, and that's all what matters. Beware of the clause "in principle", as it often only obscures the debate.

Edit: to formulate it a little bit differently, predictability is an instance of regularity in the universe, i.e. our ability to compress the data of the whole history of the universe into some brief set of laws and possibly not so brief set of initial conditions, nevertheless much smaller amount of information that the history of the universe recorded at each point and time instant. As we do not have this huge pack of information and thus can't say to what extent it is compressible, we use theories that are based much on induction, which itself is a particular bias. We don't know even whether the theories we use apply at any time and place, of for any system universally. Frequentist seem to distinguish this uncertainty - which they largely ignore in practice - from uncertainty as a property of the system. So, as I understand the state of affairs, a frequentist is satisfied with a theory (which is a comprimation algorithm applicable to the information about the universe) which includes calling the random number generator at some occasions (e.g. when dealing with dice or electrons), and such induced uncertainty he calls "property of the system". On the other hand, the uncertainty about the theory itself is a different kind of "meta-uncertainty".

The Bayesian approach seems to me more elegant (and Occam-razor friendly) as it doesn't introduce different sorts of uncertainties. It also fits better with the view of physical laws as comprimation algorithms, as it doesn't distinguish between data and theories with regard to their uncertainty. One may just accept that the history of universe needn't be compressible to data available at the moment, and use induction to estimate future states of the world in the same way as one estimates limits of validity of presently formulated physical laws.

Comment author: pengvado 27 July 2009 07:39:49AM *  5 points [-]

The relevant property of the electron+beamsplitter(+everything else) system is that its wavefunction will be evenly split between the two Everett branches. No chance involved. 50% is how much I care about each branch.

And after performing the experiment but before looking at the result, I can continue using the same reasoning: "I have already decohered, but whatever deterministic decision algorithm I apply now will return the same answer in both branches, so I can and should optimize both outcomes at once." Or I can switch to indexical uncertainty: "I am uncertain about which instance I am, even though I know the state of the universe with certainty." These two methods should be equivalent.

If we ever do find some nondeterministic physical law, then you can have your probability as a fundamental property of particles. Maybe. I'm not sure how one would experimentally distinguish "one stochastic world" from "branch both ways" or from "secure pseudo-random number generator" in the absence of any interference pattern to have a precise theory of; but I'm not going to speculate here about what physicists can or can't learn.

Comment author: GuySrinivasan 27 July 2009 07:16:19AM 1 point [-]

I believe the answer to this question is currently "we don't know". But notice that "the electron" doesn't exist, it's a pattern ("just" a pattern? :)) in the wavefunction. A pattern which happens to occur in lots of places, so we call it an electron.

My intuition, IANAP, is that if anything it is more natural to say the 50% belongs somehow to which branch you find yourself in, not the pattern in the wavefunction we call an electron.

Comment author: RolfAndreassen 27 July 2009 11:44:09PM 0 points [-]

Ok, but I don't think that matters for the question of frequentist versus Bayesian. You're still saying that the 50% is a property of something other than your own uncertainty.

Moving the problem to lexical uncertainty seems to me to rely on moving the question in time; you can only do this after you've done the experiment but before you've looked at the measurement. This feels to me like asking a different question.

Comment author: brian_jaress 26 July 2009 09:42:12PM 0 points [-]

I'd like to take advantage of frequentism's return to respectability to ask if anyonw knows where I can get a copy of "An Introduction to the Bootstrap" by Efron and Tibshirani.

It's on Google books, but I don't like reading things through Google books. It's for sale on-line, but it costs a lot and shipping takes a while. My university's library is supposed to have it, but the librarians can't find it. My local library hasn't heard of it.

I hardly know any statistics or probability; I've just been borrowing bits and pieces as I need them without worrying about bayesian vs. frequentism.

There is a little something that's been bothering me in the back of my mind when I see Eliezer waxing poetic about bayesianism. Maybe this is an ignorant question, but here it is:

If bayesians don't believe in a true probability waiting to be approximated, only in probabilities assigned by a mind, how do they justify seeking additional data? The rules require you to react to new data by moving your assigned probability in a certain way, but, without something desirable that you're moving towards, why is it good to have that new data?

Comment author: Cyan 26 July 2009 10:09:49PM *  1 point [-]

If bayesians don't believe in a true probability waiting to be approximated, only in probabilities assigned by a mind, how do they justify seeking additional data? The rules require you to react to new data by moving your assigned probability in a certain way, but, without something desirable that you're moving towards, why is it good to have that new data?

Collecting new data is not justifiable in general -- the cost of the new data may outweigh the benefit to be gained from it. But let's assume that collecting new data has a negligible cost. As a Bayesian, what you desire is the smallest loss possible. For reasonable loss functions, the smaller the region over which your distribution spreads its uncertainty (that is to say, the smaller its variance) the smaller you expect your loss to be. The law of total variance can be interpreted to say that you expect the variance of the posterior distribution to be smaller than the variance of the prior distributions.* So collect more data!

* law of total variance: prior variance = prior expectation of posterior variance + prior variance of posterior mean. This implies that the prior variance is larger than the prior expectation of posterior variance.

Comment author: brian_jaress 26 July 2009 10:32:52PM 0 points [-]

So, more data is good because it makes you more confident? I guess that makes sense, but it still seems strange not to care what you're confident in.

Comment author: Cyan 26 July 2009 10:42:28PM 2 points [-]

In any real problem there is a context and some prior information. Bayes doesn't give this to you -- you give it to Bayes along with the data and turn the crank on the machinery to get the posterior. The things you're confident about are in the context.

Comment author: brian_jaress 27 July 2009 12:27:20AM 0 points [-]

What about changing your mind?

Comment author: Cyan 27 July 2009 01:15:23AM 0 points [-]

In theory, if you can change your mind about something, you have uncertainty about it, and your prior distribution should reflect that. In practice, you abstract the uncertainty away by making some simplifying assumptions, do the analysis conditional on your assumptions, and reserve the right to revisit the assumptions if they don't seem adequate.

Comment author: brian_jaress 27 July 2009 02:53:16AM 1 point [-]

I didn't mean to ask how a bayesian changes his or her mind. I meant to ask how the thing you believe in can be in the context in situations where you change your mind based on new evidence.

Comment author: Cyan 27 July 2009 03:06:43AM *  1 point [-]

Let's say I'm weighing some acrylamide powder on an electronic balance. (Gonna make me some polyacrylamide gel!) The balance is so sensitive that small changes in air pressure register in the last two digits. From what I know about air pressure variations from having done this before, I create a model for the data. Also because I've done this before, I can eyeball roughly how much powder I've got on the balance; this determines my prior distribution before reading the balance. Then I observe some data from the balance readout and update my distribution.

Comment author: brian_jaress 27 July 2009 08:05:26AM 0 points [-]

I can't tell without more information whether that's an example of what I mean by "changing your mind." Here's one that I think definitely qualifies:

Let's say you're going to bet on a coin toss. You only have a small amount of information on the coin, and you decide for whatever reason that there's a 51% chance of getting heads. So you're going to bet on heads. But then you realize that there's a way to get more data.

At this point, I'm thinking, "Gee, I hardly know anything about this coin. Maybe I'm better off betting on tails and I just don't know it. I should get that data."

What I think you're saying about bayesians is that a bayesian would say, "Gee, 51% isn't very high. I'd like to be at least 80% sure. Since I don't know very much yet, it wouldn't take much more to get to 80%. I should get that data so I can bet on heads with confidence."

Which sort of makes sense but is also a little strange.

Comment author: Cyan 27 July 2009 03:29:39PM *  3 points [-]

Technical stuff: under the standard assumption of infinite exchangeability of coin tosses, there exists some limiting relative frequency for coin toss results. (This is de Finetti's theorem.)

Key point: I have a probability distribution for this relative frequency (call it f) -- not a probability of a probability.

You only have a small amount of information on the coin, and you decide for whatever reason that there's a 51% chance of getting heads. So you're going to bet on heads. But then you realize that there's a way to get more data.

Here you've said that my probability density for f is dispersed, but slightly asymmetric. I too can say, "Well, I have an awful lot of probability mass on values of f less than 0.5. I should collect more information to tighten this up."

"Gee, 51% isn't very high. I'd like to be at least 80% sure. Since I don't know very much yet, it wouldn't take much more to get to 80%. I should get that data so I can bet on heads with confidence."

This mixes up f on the one hand with my distribution for f on the other. I can certainly collect data until I'm 80% sure that f is bigger than 0.5 (provided that f really is bigger than 0.5). This is distinct from being 80% sure of getting heads on the next toss.

Comment author: RolfAndreassen 26 July 2009 06:39:28PM 2 points [-]

Perhaps we can try an experiment? We have here, apparently, both Bayesians and frequentists; or at a minimum, people knowledgeable enough to be able to apply both methods. Suppose I generate 25 data points from some distribution whose nature I do not disclose, and ask for estimates of the true mean and standard deviation, from a Bayesian and a frequentist? The underlying analysis would also be welcome. If necessary we could extend this to 100 sets of data points, ask for 95% confidence intervals, and see if the methods are well calibrated. (This does probably require some better method of transferring data than blog comments, though.)

As a start, here is one data set:

617.91 16.8539 83.4021 141.504 545.112 215.863 553.168 414.435 4.71129 609.623 117.189 -102.648 647.449 283.57 286.838 710.811 505.826 79.3366 171.816 105.332 540.313 429.298 -314.32 255.93 382.471

It is possible that this task does not have sufficient difficulty to distinguish between the approaches. If so, how can we add constraints to get different answers?

Comment author: marks 28 July 2009 07:23:17AM 1 point [-]

There's a difficulty with your experimental setup in that you implicitly are invoking a probability distribution over probability distributions (since you represent a random choice of a distribution). The results are going to be highly dependent upon how you construct your distribution over distributions. If your outcome space for probability distributions is infinite (which is what I would expect), and you sampled from a broad enough class of distributions then a sampling of 25 data points is not enough data to say anything substantive.

A friend of yours who knows what distributions you're going to select from, though, could incorporate that knowledge into a prior and then use that to win.

So, I predict that for your setup there exists a Bayesian who would be able to consistently win.

But, if you gave much more data and you sampled from a rich enough set of probability distributions that priors would become hard to specify a frequentist procedure would probably win out.

Comment author: RolfAndreassen 28 July 2009 04:37:14PM 2 points [-]

Hmm. I don't know if I'm a very random source of distributions; humans are notoriously bad at randomness, and there are only so many distributions readily available in standard libraries. But in any case, I don't see this as a difficulty; a real-world problem is under no obligation to give you an easily recognised distribution. If Bayesians do better when the distribution is unknown, good for them. And if not, tough beans. That is precisely the sort of thing we're trying to measure!

I don't think, though, that the existence of a Bayesian who can win, based on knowing what distributions I'm likely to use, is a very strong statement. Similarly there exists a frequentist who can win based on watching over my shoulder when I wrote the program! You can always win by invoking special knowledge. This does not say anything about what would happen in a real-world problem, where special knowledge is not available.

Comment author: marks 29 July 2009 04:02:44AM 1 point [-]

You can actually simulate a tremendous number of distributions (and theoretically any to an arbitrary degree of accuracy) by doing an approximate inverse CDF applied to a standard uniform random variable see here for example. So the space of distributions from which you could select to do your test is potentially infinite. We can then think of your selection of a probability distribution as being a random experiment and model your selection process using a probability distribution.

The issue is that since the outcome space is the space of all computable probability distributions Bayesians will have consistency problems (another good paper on the topic is here), i.e. the posterior distribution won't converge to the true distribution. So in this particular set up I think Bayesian methods are inferior unless one could devise a good prior over what distributions, I suppose if I knew that you didn't know how to sample from arbitrary probability distributions then if I put that in my prior then I may be able to use Bayesian methods to successfully estimate the probability distribution (the discussion of the Bayesian who knew you personally was meant to be tongue-in-cheek).

In the frequentist case there is a known procedure due to Parzen from the 60's .

All of these are asymptotic results, however, your experiment seems to be focused on very small samples. To the best of my knowledge there aren't many results in this case except under special conditions. I would state that without more constraints on the experimental design I don't think you'll get very interesting results. Although I am actually really in favor of such evaluations because people in statistics and machine learning for a variety of reasons don't do them, or don't do them on a broad enough scale. Anyway if you actually are interested in such things you may want to start looking here, since statistics and machine learning both have the tools to properly design such experiments.

Comment author: RolfAndreassen 29 July 2009 05:41:29PM 0 points [-]

The small samples are a constraint imposed by the limits of blog comments; there's a limit to how many numbers I would feel comfortable spamming this place with. If we got some volunteers, we might do a more serious sample size using hosted ROOT ntuples or zipping up some plain ASCII.

I do know how to sample from arbitrary distributions; I should have specified that the space of distributions is those for which I don't have to think for more than a minute or so, or in other words, someone has already coded the CDF in a library I've already got installed. It's not knowledge but work that's the limiting factor. :) Presumably this limits your prior quite a lot already, there being only so many commonly used math libraries.

Comment author: byrnema 26 July 2009 06:08:53PM *  1 point [-]

I think this was a great post for having both context and links and specifically (rather than generally) questioning assumptions the group hasn't visited in a while (if ever).

Comment author: Rune 26 July 2009 05:55:03PM 10 points [-]

Can you give a detailed numerical examples of some problem where the Bayesian and Frequentist give different answers, and you feel strongly that the Frequentist's answer is better somehow?

I think you've tried to do that, but I don't fully understand most of your examples. Perhaps if you used numbers and equations, that would help a lot of people understand your point. Maybe expand on your "And here's an ultra-short example of what frequentists can do" idea?

Comment author: cousin_it 26 July 2009 07:41:54PM *  0 points [-]

Short answer: Bayesian answers don't give coverage guarantees.

Long answer: see the comments to Cyan's post.

Comment author: Eliezer_Yudkowsky 26 July 2009 07:58:58PM 4 points [-]

"Coverage guarantees" is a frequentist concept. Can you explain where Bayesians fail by Bayesian lights? In the real world, somewhere?

Comment author: Cyan 26 July 2009 10:24:01PM 3 points [-]

How about this: a Bayesian will always predict that she is perfectly calibrated, even though she knows the theorems proving she isn't.

Comment author: wedrifid 27 July 2009 01:04:01PM 1 point [-]

How about this: a Bayesian will always predict that she is perfectly calibrated, even though she knows the theorems proving she isn't.

Wanna bet? Literally. Have a Bayesian to make and a whole bunch of predictions and then offer her bets with payoffs based on what apparent calibration the results will reflect. See which bets she accepts and which she refuses.

Comment author: Cyan 27 July 2009 01:22:43PM 1 point [-]

Are you volunteering?

Comment author: wedrifid 27 July 2009 01:43:55PM 0 points [-]

Sure. :)

But let me warn you... I actually predict my calibration to be pretty darn awful.

Comment author: Cyan 27 July 2009 03:00:29PM 0 points [-]

We need a trusted third party.

Comment author: wedrifid 27 July 2009 03:23:27PM 0 points [-]

Find a candidate.

I was about to suggest we could just bet raw ego points by publicly posting here... but then I realised I prove my point just by playing.

It should be obvious, by the way, that if the predictions you have me make pertain to black boxes that you construct then I would only bet if the odds gave a money pump. There are few cases in which I would expect my calibration to be superior to what you could predict with complete knowledge of the distribution.

Comment author: Cyan 27 July 2009 03:33:34PM *  1 point [-]

It should be obvious, by the way, that if the predictions you have me make pertain to black boxes that you construct then I would only bet if the odds gave a money pump.

Phooey. There goes plan A.

Comment author: Eliezer_Yudkowsky 26 July 2009 11:56:58PM 7 points [-]

A Bayesian will have a probability distribution over possible outcomes, some of which give her lower scores than her probabilistic expectation of average score, and some of which give her higher scores than this expectation.

I am unable to parse your above claim, and ask for specific math on a specific example. If you know your score will be lower than you expect, you should lower your expectation. If you know something will happen less often than the probability you assign, you should assign a lower probability. This sounds like an inconsistent epistemic state for a Bayesian to be in.

Comment author: Cyan 29 July 2009 02:32:24AM *  2 points [-]

I spent some time looking up papers, trying to find accessible ones. The main paper that kicked off the matching prior program is Welch and Peers, 1963, but you need access to JSTOR.

The best I can offer is the following example. I am estimating a large number of positive estimands. I have one noisy observation for each one; the noise is Gaussian with standard deviation equal to one. I have no information relating the estimands; per Jaynes, I give them independent priors, resulting in independent posteriors*. I do not have information justifying a proper prior. Let's say I use a flat prior over the positive real line. No matter the true value of each estimand, the sampling probability of the event "my posterior 90% quantile is greater than the estimand" is less than 0.9 (see Figure 6 of this working paper by D.A.S. Fraser). So the more estimands I analyze, the more sure I am that the intervals from 0 to my posterior 90% quantiles will contain less than 90% of the estimands.

I don't know if there's an exact matching prior in this problem, but I suspect it lacks the correct structure.

* This is a place I think Jaynes goes wrong: the quantities are best modeled as exchangeable, not independent. Equivalently, I put them in a hierarchical model. But this only kicks the problem of priors guaranteeing calibration up a level.

Comment author: Eliezer_Yudkowsky 29 July 2009 04:22:55AM 2 points [-]

I'm sorry, but the level of frequentist gibberish in this paper is larger than I would really like to work through.

If you could be so kind, please state:

What the Bayesian is using as a prior and likelihood function;

and what distribution the paper assumes the actual parameters are being drawn from, and what the real causal process is governing the appearance of evidence.

If the two don't match, then of course the Bayesian posterior distributions, relative to the experimenter's higher knowledge, can appear poorly calibrated.

If the two do match, then the Bayesian should be well-calibrated. Sure looks QED-ish to me.

Comment author: Cyan 29 July 2009 05:08:56AM *  6 points [-]

The example doesn't come from the paper; I made it myself. You only need to believe the figure I cited -- don't bother with the rest of the paper.

Call the estimands mu_1 to mu_n; the data are x_1 to x_n. The prior over the mu parameters is flat in the positive subset of R^n, zero elsewhere. The sampling distribution for x_i is Normal(mu_i,1). I don't know the distribution the parameters actually follow. The causal process is irrelevant -- I'll stipulate that the sampling distribution is known exactly.

Call the 90% quantiles of my posterior distributions q_i. From the sampling perspective, these are random quantities, being monotonic functions of the data. Their sampling distributions satisfy the inequality Pr(q_i > mu_i | mu_i) < 0.9. (This is what the figure I cited shows.) As n goes to infinity, I become more and more sure that my posterior intervals of the form (0, q_i] are undercalibrated.

You might cite the improper prior as the source of the problem. However, if the parameter space were unrestricted and the prior flat over all of R^n, the posterior intervals would by correctly calibrated.

But it really is fair to demand a proper prior. How could we determine that prior? Only by Bayesian updating from some pre-prior state of information to the prior state of information (or equivalently, by logical deduction, provided that the knowledge we update on is certain). Right away we run into the problem that Bayesian updating does not have calibration guarantees in general (and for this, you really ought to read the literature), so it's likely that any proper prior we might justify does not have a calibration guarantee.

Comment author: cousin_it 26 July 2009 08:47:09PM *  3 points [-]

Of course not. If you choose to care only about the things Bayes can give you, it's a mathematical fact that you can't do better.

Comment author: wedrifid 26 July 2009 09:22:19PM 6 points [-]

I didn't like the "by Bayesian lights" phrase either. What I take as the relevant part of the question is this:

Can you provide an example of a frequentist concept that can be used to make predictions in the real world for which a bayesian prediction will fail?

"Bayesian answers don't give coverage guarantees" doesn't demonstrate anything by itself. The question is could the application of Bayes give a prediction equal to or superior to the prediction about the real world implicit in a coverage guarantee?

If you can provide such an example then you will have proved many people to be wrong in a significant, fundamental way. But I haven't seen anything in this thread or in either of Cyan's which fits that category.

Comment author: cousin_it 26 July 2009 09:32:16PM *  2 points [-]

Once again: the real-world performance (as opposed to internal coherence) of the Bayesian method on any given problem depends on the prior you choose for that problem. If you have a well-calibrated prior, Bayes gives well-calibrated results equal or superior to any frequentist methods. If you don't, science knows no general way to invent a prior that will reliably yield results superior to anything at all, not just frequentist methods. For example, Jaynes spent a large part of his life searching for a method to create uninformative priors with maxent, but maxent still doesn't guarantee you anything beyond "cross your fingers".

Comment author: Eliezer_Yudkowsky 26 July 2009 09:33:43PM 14 points [-]

If your prior is screwed up enough, you'll also misunderstand the experimental setup and the likelihood ratios. Frequentism depends on prior knowledge just as much as Bayesianism, it just doesn't have a good formal way of treating it.

Comment author: cousin_it 27 July 2009 06:34:02AM *  3 points [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

If tomorrow Bayesians find a good formalization of "uninformative prior" and a general formula to devise them, you'll happily discard your old bullshit prior and go with the flow, thus admitting that your careful analysis of my words about "unknown normal distribution" today wasn't relevant at all. This is the most fishy part IMO.

(Disclaimer: I am not a crazy-convinced frequentist. I'm a newbie trying to get good answers out of Bayesians, and some of the answers already given in these threads satisfy me perfectly well.)

Comment author: wedrifid 27 July 2009 12:48:28PM *  1 point [-]

I give you some numbers taken from a normal distribution with unknown mean and variance. If you're a frequentist, your honest estimate of the mean will be the sample mean. If you're a Bayesian, it will be some number off to the side, depending on whatever bullshit prior you managed to glean from my words above - and you don't have the option of skipping that step, and don't have the option of devising a prior that will always exactly match the frequentist conclusion because math doesn't allow it in the general case . (I kinda equivocate on "honest estimate", but refusing to ever give point estimates doesn't speak well of a mathematician anyway.) So nah, Bayesianism depends on priors more, not "just as much".

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher (or whichever other statistitian created or insisted upon the use of the particular tool in question). It does not follow that the Bayesian is relying on 'Bullshit' more than the frequentist. In fact, when I use the label 'bullshit' I usually mean 'the use of authority or social power mechanisms in lieu of or in direct defiance of reason'. I obviously apply 'bullshit prior' to the frequentist option in this case.

Comment author: orthonormal 27 July 2009 07:17:32PM 1 point [-]

Vocabulary nitpick: I believe you wrote "in luew of" in lieu of "in lieu of".

Sorry, couldn't help it. IAWYC, anyhow.

Comment author: cousin_it 27 July 2009 02:25:13PM *  2 points [-]

A Bayesian does not have the option of 'just skipping that step' and choosing to accept whichever prior was mandated by Fisher

Why in the world doesn't a Bayesian have that option? I thought you were a free people. :-) How'd you decide to reject those priors in favor of other ones, anyway? As far as I currently understand, there's no universally accepted mathematical way to pick the best prior for every given problem and no psychologically coherent way to pick it of your head either, because it ain't there. In addition to that, here's some anecdotal evidence: I never ever heard of a Bayesian agent accepting or rejecting a prior.

Comment author: Cyan 27 July 2009 06:57:19AM 9 points [-]

The normal distribution with unknown mean and variance was a bad choice for this example. It's the one case where everyone agrees what the uninformative prior is. (It's flat with respect to the mean and the log-variance.) This uninformative prior is also a matching prior -- posterior intervals are confidence intervals.

Comment author: cousin_it 27 July 2009 07:27:33AM *  2 points [-]

I didn't know that was possible, thanks. (Wow, a prior with integral=infinity! One that can't be reached as a posterior after any observation! How'd a Bayesian come by that? But seems to work regardless.) What would be a better example?

ETA: I believe the point raised in that comment still deserves an answer from Bayesians.

Comment author: byrnema 26 July 2009 05:52:49PM *  0 points [-]

Being a frequentist who hangs out on a Bayesian forum, I've thought about the difference between the two perspectives. I think the dichotomy is analogous to bottom-up verses top-down thinking; neither one is superior to the other but the usefulness of each waxes and wanes depending upon the current state of a scientific field. I think we need both to develop any field fully.

Possibly my understanding of the difference between a frequentist and Bayesian perspective is different than yours (I am a frequentist after all) so I will describe what I think the difference is here. I think the two POVs can definitely come to the same (true) conclusions, but the algorithm/thought-process feels different.

Consider tossing a fair-coin. Everyone observes that on average, heads comes up 50% of the time. A frequentist sees the coin-tossing as a realization of the abstract Platonic truth that the coin has a 50% chance of coming up heads. A Bayesian, in contrast, believes that the realization is the primary thing ... the flipping of the coin yields the property of having 50% probability of coming up heads as you flip it. So both perspectives require the observation of many flips to ascertain that the coin is indeed fair, but the only difference between the two views is that the frequentist sees the "50% probability of being heads" as something that exists independently of the flips. It's something you discover rather than something you create.

Seen this way, it sounds like frequentists are Platonists and Bayesians are non-Platonists. Abstract mathematicians tend to be Platonists (but not always) and they've lent their bias to the field. Smart Bayesians, on the other hand, tend to be more practical and become experimentalists.

There's definitely a certain rankle between Platonists and non-Platonists. Non-platonists think that Platonists are nuts, and Platonists think that the non-Platonists are too literal.

May we consider the hypothesis that this difference is just a difference in brain hard-wiring? When a Platonist thinks about a coin flipping and the probability of getting heads, they really do perceive this "probability" as existing independently. However, what do they mean by "existing independently"? We learn what words mean from experience. A Platonist has experience of this type of perception and knows what they mean. A non-Platonist doesn't know what is meant and thinks the same thing is meant as what everyone means when they say "a table exists". These types of existence are different, but how can a Bayesian understand the Platonic meaning without the Platonic experience?

A Bayesian should just observe what does exist, and what words the Platonist uses, and redefine the words to match the experience. This translation must be done similarly with all frequentist mathematics, if you are a Bayesian.

Comment author: antibole 27 July 2009 10:19:58PM 2 points [-]

Being a Platonist and a frequentist aren't the same thing, but they correlate because they're both errors in thinking.

The objection to frequentism is that it builds the answer into the solution so the problem actually changes from the original real world problem. This is fine as long as you can test discrepancies between theory and practice, but that's not always going to possible.

Comment author: PhilGoetz 27 July 2009 04:12:36PM 0 points [-]

"A Bayesian, in contrast, believes that the realization is the primary thing ... the flipping of the coin yields the property of having 50% probability of coming up heads as you flip it."

Thanks for trying to explain the difference, but I have no idea what this means.

Comment author: byrnema 27 July 2009 05:02:12PM *  0 points [-]

What I was thinking about was this: Bayesians and frequentists both agree that if a fair coin is tossed n times (where n is very large) then a string of heads and tails will result and the probability of heads is .5 in some way related to the fact that the number of heads divided by n will approach .5 for large n.

In my mind, the frequentist perspective is that the .5 probability of getting heads exists first, and then the string of heads and tails realize (i.e., make a physical manifestation of) this abstract probability lurking in the background. As though there is a bin of heads and tails somewhere with exactly a 1:1 ratio and each flip picks randomly from this bin. The Bayesian perspective is that there is nothing but the string of heads and tails -- only the string exists, there's no abstract probability that the string is a realization of. No picking from a bin in the sky. Inspecting the string, a Bayesian can calculate the 0.5 probability ... so the 0.5 probability results from the string. So according to me, the philosophical debate boils down to: what comes first, the probability or the string?

I definitely get the impression that the Bayesians in this thread are skeptical of this description of the difference, and seem to prefer describing the difference of the Bayesian view as considering probability a measure of your uncertainty. However, probability is also taught as a measure of uncertainty in classical probability, so I'm skeptical of this dichotomy. (In favor of my view, the name "frequentist" comes from the observation that they believe in a notion of "frequency" -- i.e., that there's a hypothetical distribution "out there" that observed data is being sampled from.)

Perhaps the difference in whether the correct approach is subjective or objective better gets to the heart of the difference. I am leaning towards this hypothesis because I can see how a frequentist can confuse something being objective with that something having an independent "existence".

Comment author: bdwolfhound 09 August 2009 02:57:32PM *  0 points [-]

I have a little difficulty with the notion that the probable outcome of a coin toss is the result of the toss, rather like the collapse of a quantum probability into reality when observed. Looking at the coin before the toss, surely three probabilities may be objectively observed - H, T or E, and the likelihood of the coin coming to rest on its edge dismissed.

Since the coin MUST then end up H or T ; the sum of both probabilities is 1, both outcomes are a priori equally likely and have the value1/2 before the toss. Whether one chooses to believe that the a priori probabilities have actual existence is a metaphysical issue.

Comment author: JGWeissman 26 July 2009 06:16:07PM 4 points [-]

Seen this way, it sounds like frequentists are Platonists and Bayesians are non-Platonists.

Counterexample: I have a Platonic view of mathematical truths, but a Bayesian view of probability.

A frequentist sees the coin-tossing as a realization of the abstract Platonic truth that the coin has a 50% chance of coming up heads.

This does not make sense. For any given coin flip, either the fundamental truth is that the coin will come up heads, or the fundamental truth is that the coin will come up tails. The 50% probability represents my uncertainty about the fundamental truth, which is not a property of the coin.

Comment author: byrnema 26 July 2009 06:40:16PM *  1 point [-]

Counterexample: I have a Platonic view of mathematical truths, but a Bayesian view of probability.

That's interesting. I had imagined that people would be one way or the other about everything. Can anyone else provide datapoints on whether they are Platonic about only a subset of things?

... in order to triangulate closer to whether Platonism is "hard-wired", do you find it possible to be non-Platonic about mathematical truths? Can someone who is non-Platonic think about them Platonically -- is it a choice?

For any given coin flip, either the fundamental truth is that the coin will come up heads, or the fundamental truth is that the coin will come up tails. The 50% probability represents my uncertainty about the fundamental truth, which is not a property of the coin.

See, that's just not the way a frequentist sees it. At first I notice, you are defining "fundamental truth" as what will actually happen in the next coin flip. In contrast, it is more natural to me to think of the "fundamental truth" as being what the probability of heads is, as a property of the coin and the flip, since the outcome isn't determined yet. But that's just asking different questions. So if the question is, what is the truth about the outcome of the next flip, we are talking about empirical reality (an experiment) and my perspective will be more Bayesian.

Comment author: MichaelVassar 27 July 2009 06:14:09AM 2 points [-]

I'm Platonistic in general I suppose, but I see Bayesianism as subjectively objective as a Platonistic truth.

Comment author: gjm 26 July 2009 08:55:34PM 3 points [-]

Can anyone else provide datapoints [...]

I am a Platonist about mathematics by inclination, though I strongly suspect that this inclination is one that I should resist taking too seriously. I am a Bayesian about proability (at least in the following sense: it seems to me that the Bayesian approach subsumes the others, when they are applied correctly). I am mostly Bayesian about statistics, but don't see any reason why you shouldn't compute confidence intervals and unbiased estimators if you want to. I don't think "Platonist" and "frequentist" are at all the same thing, so I don't see any of the above as indicating that I'm (inclined to be) Platonist about some things but not about others.

[...] the fundamental truth [...]

This seems to have prompted a debate about whether The Fundamental Truth is one about the general propensities of the coin, or one about what will happen the next time it's flipped. I don't see why there should be exactly one Fundamental Truth about the coin; I'd have thought there would be either none or many depending on what sort of thing one wishes to count as a "fundamental truth".

Anyway: imagine a precision robot coin-flipper. I hope it's clear that with such a device one could arrange that the next million flips of the coin all come up heads, and then melt it down. So whatever "fundamental truth" there might be about What The Coin Will Do has to be relative to some model of what's going to be done to it. The point of coin-flipping is that it's a sort of randomness magnifier: small variations in what you do to it make bigger differences to what it does, so a small patch of possibility-space gets turned into a somewhat-uniform sampling of a larger patch (caution: Liouville, volume conservation, etc.). And the "fundamental truth" about the coin that you're appealing to is that, plus what it implies about its ability to turn kinda-sorta-slightly-random-ish coin flipping actions into much more random-ish outcomes. To turn that into an actual expectation of (more or less) independent p=1/2 Bernoulli trials, you need to add some assumption about how people actually flip coins, and then the magic of physics means that a wide range of such assumptions all lead to very similar-looking conclusions about what the outcomes are likely to look like.

In other words: an accurate version of the frequentist way of looking at the coin's behaviour starts with some assumption (wherever it happens to come from) about how coins actually get flipped, mixes that with some (not really probabilistic) facts about the coin, and ends up with a conclusion about what the coin is likely to do when flipped, which doesn't depend too sensitively on that assumption we made.

Whereas a Bayesian way of looking at it starts with some assumption (wherever it happens to come from) about what happens when coins get flipped, mixes that with some (not really probabilistic) facts about what the coin has been observed to do and perhaps a bit of physics, and ends up with a conclusion about what the coin is likely to do when flipped in the future, which doesn't depend too sensitively on that assumption we made.

Clearly the philosophical differences here are irreconcilable...

Comment author: Vladimir_Nesov 26 July 2009 07:48:23PM *  4 points [-]

since the outcome isn't determined yet

The outcome is determined timelessly, by the properties of the coin-tossing setup. It hasn't happened yet. What came before the coin determines the coin, but in turn is determined by the stuff located further and further in the past from the actual coin-toss. It is a type error to speak of when the outcome is determined.

Comment author: byrnema 26 July 2009 08:17:03PM *  2 points [-]

Whether or not the universe is deterministic is not determined yet. Even if you and I both think that a deterministic universe is more logical, we should accept that certain figures of speech will persist. When I said the toss wasn't determined yet, I meant that the outcome of the toss was not known yet by me. I don't see how your correction adds to the discussion except possibly to make me seem naive, like I've never considered the concept of determinism before.

Comment author: Nick_Tarleton 26 July 2009 08:38:20PM *  2 points [-]

what the probability of heads is, as a property of the coin and the flip

I meant that the outcome of the toss was not known yet by me

Map/territory distinction. As a property of the actual coin and flip, the probability of heads is 0 or 1 (modulo some nonzero but utterly negligible quantum uncertainty); as a property of your state of knowledge, it can be 0.5.

Comment author: byrnema 26 July 2009 09:38:35PM *  0 points [-]

This comment helped things come into better focus for me.

A frequentist believes that there is a probability of flipping heads, as a property of the coin and (yes, certainly) the conditions of the flipping. To a frequentist, this probability is independent of whether the outcome is determined or not and is even independent of what the outcome is. Consider the following sequence of flips: H T T

A frequentist believes that the probability of flipping heads was .5 all along right? The first 'H' and the second 'T' and the third 'T' were just discrete realizations of this probability.

The reasons why I've been calling this a Platonic perspective is because I think the critical difference in philosophy is the frequentist idea of this non-empirical "probability' existing independent of realizations. The probability of flipping heads for a set of conditions is .5 whether you actually flip the coins or not. However, frequentists agree you must flip the coin to know that the probability was .5.

You might think this perspective is wrong-headed, and from a strict empirical view where you allow no Platonic entities/concepts, it kind of is. But the question I am really interested in is the following: to what extent is this point of view a choice we can be wrong or right about, or a perspective that some (or most?) people have hard-wired in their physical brain? Further, how can you argue that it isn't useful when it demonstrably has been so useful? Perhaps it facilitates or is necessary for some categories of abstract thought.

Comment author: JGWeissman 26 July 2009 09:43:59PM 5 points [-]

But the question I am really interested in is the following: to what extent is this point of view a choice we can be wrong or right about, or a perspective that most people have hard-wired in their physical brain algorithms?

It could be hard-wired and still be right or wrong.

Comment author: byrnema 26 July 2009 10:13:48PM *  0 points [-]

Correct, generally. But how could a perspective be wrong?

I can think of two ways a perspective can be wrong: either because it (a) asserts a fact about external reality that is not true or (b) yields false conclusions about the external world.

(a) Frequentists don't assert anything extra about the empirical world, they assert the use of (and obstensibly, the "existence" of) something symbolic. From the empiricist perspective, it's not really there. Like a little icon floating above or around the actual thing that your cursor doesn't interact with, so it can't be false in the empirical sense.

(b) It would be fascinating if the frequentist perspective yielded false conclusions,and in such a case, is there any doubt that people would develop and embrace new mathematics that avoided such errors? In fact, we already see this happening where physics at extreme scales seems to defy intuition. If someone wanted to propose a new theory of everything I don't think anyone would ever criticize it on the grounds of not being frequentist. I guess the point here is just that it's useful or not.

Later edit: Ok, I finally get it. Maybe the reason we don't understand physics at the extreme scales is because the frequentist approach was evolved (hard-wired) for understanding intermediate physical scales and it's (apparently) beginning to fail. You guys are using empirical philosophy to try and develop a new brand mathematics that won't have these inborn errors of intuition. So while I argue that frequentism has definitely been productive so far, you argue that it is intrinsically limited based on philosophical principles.

Comment author: JGWeissman 26 July 2009 10:41:47PM 2 points [-]

A perspective can be wrong if it arbitrarily assigns a probability of 1 to an event that has a symmetrical alternative. Read the intro to My Bayesian Enlightenment for Eliezer's description of a frequentist going wrong in this way with respect to the problem of the mathematician with two children, at least one of which is a boy.

Comment author: Vladimir_Nesov 26 July 2009 08:46:26PM *  0 points [-]

Giving "probably" of actual outcome for the coin flip as ~1 looks like a type error, although it's clear what you are saying. It's more like P(coin is heads|coin is heads), tautologically 1, not really a probability.

Comment author: Nick_Tarleton 26 July 2009 09:30:28PM 0 points [-]

Edited to clarify.

Comment author: Vladimir_Nesov 26 July 2009 10:12:00PM *  0 points [-]

As a property of the actual coin and flip, the probability of heads is 0 or 1 (modulo some nonzero but utterly negligible quantum uncertainty)

This mixes together two different kinds of probability, confusing the situation. There is nothing fuzzy about the events defining the possible outcomes, the fact that there is also indexical uncertainty imposed on your mind while it observes the outcome is from a different problem.

Comment author: Nick_Tarleton 26 July 2009 10:24:31PM *  0 points [-]

Yeah, it just felt like too much work to add "...randomly sampling from future Everett branches according to the Born probabilities" or the like.

Comment author: Vladimir_Nesov 26 July 2009 08:26:06PM *  0 points [-]

When I said the toss wasn't determined yet, I meant that the outcome of the toss was not known yet by me.

Hence it's your uncertainty, which can as well be handled in deterministic world. And in deterministic world, I don't know how to parse your sentence

it is more natural to me to think of the "fundamental truth" as being what the probability of heads is, as a property of the coin and the flip

Comment author: GuySrinivasan 26 July 2009 07:25:03PM 2 points [-]

As a property of the coin and the flip and the environment and the laws of physics, the probability of heads is either 0 or 1. Just because you haven't computed it doesn't mean the answer becomes a superposition of what you might compute, or something.

What you want is something like the result of taking a natural generalization of the exact situation - if the universe is continuous and the system is chaotic enough "round to some precision" works - and then computing the answer in this parameterized space of situations, and then averaging over the parameter.

The problem is that "natural generalization" is pretty hard to define.

Comment author: JGWeissman 26 July 2009 07:09:36PM 3 points [-]

... in order to triangulate closer to whether Platonism is "hard-wired", do you find it possible to be non-Platonic about mathematical truths? Can someone who is non-Platonic think about them Platonically -- is it a choice?

Most of the time I think about math, I do not worry about if it is platonic or not. It was really only in the context of considering my epistemic uncertainty that 2+2=4 that I needed consider the nature of the territory I was mapping, and in this context it did not make sense for the territory to be the physical universe.

In contrast, it is more natural to me to think of the "fundamental truth" as being what the probability of heads is, as a property of the coin and the flip, since the outcome isn't determined yet.

You mean, the outcome has not been determined by you, since you have not observed all the physical properties of coin, the person flipping it, and the environment, and calculated out all the physics that would tell you whether it would land heads or tails. Attaching a probability to the coin is just our way of dealing with the ignorance and lack of computing power that prevents us from finding the exact answer.

Comment author: byrnema 26 July 2009 07:17:30PM -1 points [-]

What is your point? You iterate the Bayesian perspective, but do you agree that frequentists and Bayesians have different perspectives about this?

I think it boils down to this: you are a frequentist (and I've been using the term Platonist) if you see the 50% probability as a property of the coin and the flip, and you are a Bayesian if you see the 50% probability as just a way of measuring the uncertainty.

(Given your rationale for being Platonic about mathematics, I don't know if you are really a Platonist (in the hard-wired sense).)

Comment author: JGWeissman 26 July 2009 07:40:06PM 4 points [-]

My point is that the view that 50% probability is a fundamental property of the coin is wrong. It is an example of the Mind Projection Fallacy, thinking that because you don't know the result, somehow the universe doesn't either. It is certainly not the case that when asked about the result of a single coin flip, that giving a 50% probability for heads is the best possible answer. One could, in principle, do more investigation, and find that under the current conditions, the coin will come up heads (or tails) with 99% probability, and actually be right 99 times out of a hundred.

I don't like to call this view of the probability as a fundamental property of the coin the frequentist view. It makes more sense to describe their perspective as a the probability being a combined property of the coin and a distribution of conditions in which it could be flipped. From this perspective, the mistake of attaching the probability to the coin is that miss the fact that you are flipping the coin in one particular condition, which will have a definite outcome. The probability comes from uncertainty of which condition from the distribution applies in this case, and of course, limits on computational power.

Comment author: byrnema 26 July 2009 08:27:33PM *  0 points [-]

Are you saying that frequentists are wrong, or just me?

If the former, how can you say that and consider the case closed when frequentists arrive at correct conclusions? What I'm suggesting is that Bayesians are committing the mind projection fallacy when they assert that frequentists are "wrong".

Comment author: JGWeissman 26 July 2009 09:29:31PM 2 points [-]

I am saying that you are wrong, and I am not sure there isn't more to the frequentist view than you are saying, so I am not prepared to figure out if it is right or wrong until I know more about what it is saying.

If the former, how can you say that and consider the case closed when frequentists arrive at correct conclusions?

Like in the Monty Hall problem, where the frequentists will agree to the correct answer after you beat them over the head with a computer simulation?

What I'm suggesting is that Bayesians are committing the mind projection fallacy when they assert that frequentists are "wrong".

Huh? What property of our minds do you think we are projecting onto the territory?

Comment author: byrnema 27 July 2009 02:14:23AM 1 point [-]

In the Monty Hall problem, intuiton tends to insist on the wrong answer, not valid application of frequentist theory.

Just curious -- is the monty hall solution intuitively obvious to a "Bayesian", or do they also need to work through the (Bayesian) math in order to be convinced?

Huh? What property of our minds do you think we are projecting onto the territory?

Oops. I meant the typical mind fallacy.

Comment author: JGWeissman 27 July 2009 02:26:14AM 2 points [-]

Just curious -- is the monty hall solution intuitively obvious to a "Bayesian", or do they also need to work through the (Bayesian) math in order to be convinced?

For me at least, it is not so much that the solution is intuitively obvious as that setting up the Bayesian math forces me to ask the important questions.

I meant the typical mind fallacy.

Then how do you think we are assuming that others think like us? It seems to me that we notice that others are not thinking like us, and that in this case, the different thinking is an error. I believe that 2+2=4, and if I said that someone was wrong for claiming that 2+2=3, that would not be a typical mind fallacy.

Comment author: Eliezer_Yudkowsky 26 July 2009 05:39:09PM 17 points [-]

a good Bayesian must never be uncertain about the probability of any future event

Who? Whaa? Your probability is your uncertainty.

Comment author: marks 28 July 2009 07:06:50AM 1 point [-]

I think what Shalizi means is that a Bayesian model is never "wrong", in the sense that it is a true description of the current state of the ideal Bayesian agent's knowledge. I.e., if A says an event X has probability p, and B says X has probability q, then they aren't lying even if p!=q. And the ideal Bayesian agent updates that knowledge perfectly by Bayes' rule (where knowledge is defined as probability distributions of states of the world). In this case, if A and B talk with each other then they should probably update, of course.

In frequentist statistics the paradigm is that one searches for the 'true' model by looking through a space of 'false' models. In this case if A says X has probability p and B says X has probability q != p then at least one of them is wrong.

Comment author: orthonormal 26 July 2009 08:21:36PM 4 points [-]

Also, didn't we already cover metauncertainty here?

Comment author: Cyan 26 July 2009 09:24:53PM *  1 point [-]

Yup. Shalizi's point is that once you've taken meta-uncertainty into account (by marginalizing over it), you have a precise and specific probability distribution over outcomes.

Comment author: Eliezer_Yudkowsky 26 July 2009 09:36:14PM 14 points [-]

Well, yes. You have to bet at some odds. You're in some particular state of uncertainty and not a different one. I suppose the game is to make people think that being in some particular state of uncertainty, corresponds to claiming to know too much about the problem? The ignorance is shown in the instability of the estimate - the way it reacts strongly to new evidence.

Comment author: Cyan 26 July 2009 10:35:19PM *  6 points [-]

I'm with you on this one. What Shalizi is criticizing is essentially a consequence of the desideratum that a single real number shall represent the plausibility of an event. I don't think the methods he's advocating dispense with the desideratum, so I view this as a delicious bullet-shaped candy that he's convinced is a real bullet and is attempting to dodge.

Comment author: Nick_Tarleton 26 July 2009 08:29:33PM *  2 points [-]

Shalizi says "Bayesian agents never have the kind of uncertainty that Rebonato (sensibly) thinks people in finance should have". My guess is that this means (something that could be described as) uncertainty as to how well-calibrated one is, which AFAIK hasn't been explicitly covered here.

Comment author: Eliezer_Yudkowsky 26 July 2009 05:35:47PM 20 points [-]

Hypothesis testing: I give you a black-box random distribution and claim it obeys a specified formula. You sample some data from the box and inspect it. Frequentism often allows you to call me a liar and be wrong no more than 10% of the time, guaranteed, no priors in sight.

Wrong. If all black boxes do obey their specified formulas, then every single time you call the other person a liar, you will be wrong. P(wrong|"false") ~ 1.

I'm thinking you still haven't quite understood here what frequentist statistics do.

It's not perfectly reliable. They assume they have perfect information about experimental setups and likelihood ratios. (Where does this perfect knowledge come from? Can Bayesians get their priors from the same source?)

A Bayesian who wants to report something at least as reliable as a frequentist statistic, simply reports a likelihood ratio between two or more hypotheses from the evidence; and in that moment has told another Bayesian just what frequentists think they have perfect knowledge of, but simply, with far less confusion and error and mathematical chicanery and opportunity for distortion, and greater ability to combine the results of multiple experiments.

And more importantly, we understand what likelihood ratios are, and that they do not become posteriors without adding a prior somewhere.

Comment author: cousin_it 26 July 2009 05:45:50PM *  2 points [-]

Thanks for the catch, struck out that part.

Yes, you can get your priors from the same source they get experimental setups: the world. Except this source doesn't provide priors.

ETA: likelihood ratios don't seem to communicate the same info about the world as confidence intervals to me. Can you clarify?

Comment author: conchis 26 July 2009 07:54:57PM *  1 point [-]

Wrong. If all black boxes do obey their specified formulas, then every single time you call the other person a liar, you will be wrong. P(wrong|"false") ~ 1.

Ok, bear with me. cousin_it's claim was that P(wrong|boxes-obey-formulas)<=.1, am I right? I get that P(wrong|"false" & boxes-obey-formulas) ~ 1, so the denial of cousin_it's claim seems to require P("false"|boxes-obey-formulas) > .1? I assumed that the point was precisely that the frequentist procedure will give you P("false"|boxes-obey-formulas)<=.1. Is that wrong?

Comment author: cousin_it 26 July 2009 09:58:57PM *  2 points [-]

My claim was what Eliezer said, and it was incorrect. Other than that, your comment is correct.

Comment author: conchis 26 July 2009 10:17:36PM 0 points [-]

Ah, I parsed it wrongly. Whoops. Would it be worth replacing it with a corrected claim rather than just striking it?

Comment author: cousin_it 26 July 2009 10:42:06PM *  0 points [-]

Done. Thanks for the help!

Comment author: AllanCrossman 26 July 2009 05:22:25PM 2 points [-]

What does one read to become well versed in this stuff in two days; and how much skill with maths does it require?

Comment author: cousin_it 26 July 2009 05:28:22PM *  2 points [-]

Ouch! Now I see the two days stuff looks like boasting. Don't worry, all my LW posts up to now have contained stupid mathematical mistakes, and chances are people will find errors in this one too :-)

(ETA: sure enough, Eliezer has found one. Luckily it wasn't critical.)

I have a degree in math and competed at the national level in my teens (both in Russia), but haven't done any serious math since I graduated six years ago. The sources for this post were mostly Wikipedia and Google searches on keywords from Wikipedia.

Comment author: AllanCrossman 26 July 2009 05:41:47PM 2 points [-]

My comment was an honest question and was not intended as derogatory...