Cyan comments on What is Bayesianism? - Less Wrong

81 26 February 2010 07:43AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

Comment author: 26 February 2010 12:32:18PM 18 points [-]

is there a simple explanation of the conflict between bayesianism and frequentialism? I have sort of a feel for it from reading background materials but a specific example where they yield different predictions would be awesome. has such already been posted before?

Comment author: 26 February 2010 08:49:22PM *  7 points [-]

Eliezer's views as expressed in Blueberry's links touch on a key identifying characteristic of frequentism: the tendency to think of probabilities as inherent properties of objects. More concretely, a pure frequentist (a being as rare as a pure Bayesian) treats probabilities as proper only to outcomes of a repeatable random experiment. (The definition of such a thing is pretty tricky, of course.)

What does that mean for frequentist statistical inference? Well, it's forbidden to assign probabilities to anything that is deterministic in your model of reality. So you have estimators, which are functions of the random data and thus random themselves, and you assess how good they are for your purpose by looking at their sampling distributions. You have confidence interval procedures, the endpoints of which are random variables, and you assess the sampling probability that the interval contains the true value of the parameter (and the width of the interval, to avoid pathological intervals that have nothing to do with the data). You have statistical hypothesis testing, which categorizes a simple hypothesis as “rejected” or “not rejected” based on a procedure assessed in terms of the sampling probability of an error in the categorization. You have, basically, anything you can come up with, provided you justify it in terms of its sampling properties over infinitely repeated random experiments.

Comment author: 26 February 2010 09:19:32PM *  7 points [-]

Here is a more general definition of "pure frequentism" (which includes frequentists such as Reichenbach):

Consider an assertion of probability of the form "This X has probability p of being a Y." A frequentist holds that this assertion is meaningful only if the following conditions are met:

1. The speaker has already specified a determinate set X of things that actually have or will exist, and this set contains "this X".

2. The speaker has already specified a determinate set Y containing all things that have been or will be Ys.

The assertion is true if the proportion of elements of X that are also in Y is precisely p.

A few remarks:

1. The assertion would mean something different if the speaker had specified different sets X and Y, even though X and Y aren't mentioned explicitly in the assertion.

2. If no such sets had been specified in the preceding discourse, the assertion by itself would be meaningless.

3. However, the speaker has complete freedom in what to take as the set X containing "this X", so long as X contains X. In particular, the other elements don't have to be exactly like X, or be generated by exactly the same repeatable procedure, or anything like that. There are practical constraints on X, though. For example, X should be an interesting set.

4. [ETA:] An important distinction between Bayesianism and Frequentism is this: Note that, according to the above, the correct probability has nothing to do with the state of knowledge of the speaker. Once the sets X and Y are determined, there is an objective fact of the matter regarding the proportion of things in X that are also in Y. The speaker is objectively right or wrong in asserting that this proportion is p, and that rightness or wrongness had nothing to do with what the speaker knew. It had only to do with the objective frequency of elements of Y among the elements of X.

Comment author: 29 September 2013 07:24:43AM 4 points [-]

I'm sorry to see such wrongheaded views of frequentism here. Frequentists also assign probabilities to events where the probabilistic introduction is entirely based on limited information rather than a literal randomly generated phenomenon. If Fisher or Neyman was ever actually read by people purporting to understand frequentist/Bayesian issues, they'd have a radically different idea. Readers to this blog should take it upon themselves to check out some of the vast oversimplifications... And I'm sorry but Reichenbach's frequentism has very little to do with frequentist statistics--. Reichenbach, a philosopher, had an idea that propositions had frequentist probabilities. So scientific hypotheses--which would not be assigned probabilities by frequentist statisticians--could have frequentist probabilities for Reichenbach, even though he didn't think we knew enough yet to judge them. He thought at some point we'd be able to judge of a hypothesis of a type how frequently hypothesis like it would be true. I think it's a problematic idea, but my point was just to illustrate that some large items are being misrepresented here, and people sold a wrongheaded view. Just in case anyone cares. Sorry to interrupt the conversation (errorstatistics.com)

Comment author: 30 September 2013 12:24:49AM *  1 point [-]

Do you intend to be replying to me or to Tyrrell McAllister?

Comment author: 27 February 2010 05:47:54AM *  2 points [-]

What does that mean for frequentist statistical inference? Well, it's forbidden to assign probabilities to anything that is deterministic in your model of reality.

Wait - Bayesians can assign probabilities to things that are deterministic? What does that mean?

What would a Bayesian do instead of a T-test?

Comment author: 27 February 2010 10:34:45AM *  13 points [-]

Wait - Bayesians can assign probabilities to things that are deterministic? What does that mean?

Absolutely!

The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is "deterministic", as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer's body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I've been dealt, and what (if any) are public is straightforward. Incorporating player's actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents' likely hands.

What would a Bayesian do instead of a T-test?

In most cases we'd step back, and ask what you were trying to do, such that a T-test seemed like a good idea.

For those unaware, a t-test is a way of calculating the "likelihood" for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say "we can't reject it". If it is terribly unlikely, the Frequentists say that it can be rejected -- that it's worth looking at another model.

From a Bayesian perspective, this is somewhat backwards -- we don't really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let's just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some "prior probability" of each model -- how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).

In this framework, we don't have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can't just reject a theory if the data don't match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation -- we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.

There are a couple unsatisfying bits here:
First it really would be nice to say "this theory is ridiculous because it doesn't explain the data" without any reference to any other theory. But if we know it's the only theory in town, we don't have a choice. If it's not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are.
Second, we need to give "prior probabilities" to our various theories, and the math doesn't give any direct justifications for what these should be. However, as long as these aren't crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren't will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we'll only pick a good model if we have a good model in this space of theories. If you eventually realize that "hey, all these models are crappy", there is no good way of expanding the set of models you're willing to consider, though a common way is to just "start over" with an expanded model space, and reallocated prior probabilities. You can't just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren't actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).

A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn't much meaning -- only in comparison to the rest in a set tells you anything useful.

Comment author: 27 February 2010 01:35:12PM 1 point [-]

Very nice! I'd only replace "useful" with "plausible". (Sure, it's hard to define plausibility, but usefulness is not really the right concept.)

Comment author: 27 February 2010 07:19:00PM *  3 points [-]

"Usefulness" certainly isn't the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes's Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses "actually are" probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.

But precisely because only the manipulation rules matter, I'm not sure it is worth emphasizing that "to be a good Bayesian" you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I'm not sure I'm comfortable doing that. Data and event probabilities are things that can eventually be "resolved" to true or false, by looking after the fact. Probability as plausibility makes sense for these things.

But for hypotheses and models, I ask myself "plausibility of what? Being true?" Almost certainly, the "real" model (when that even makes sense) isn't in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely -- effectively that all data points are drawn from the same distribution. Data often doesn't behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.

I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how "tractable" the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

Comment author: 27 February 2010 11:25:03PM 5 points [-]

...shades into decision theory...Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?

Comment author: 27 February 2010 11:32:36PM *  1 point [-]

Well, I said shaded into the lines of decision theory...

Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they're essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn't use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy -- not earlier.

EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze -- they never make it into our set of hypotheses in the first place. It's amazing how often conjugate priors "just happen" to be chosen...

Comment author: 27 February 2010 11:12:08PM 2 points [-]

Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

I am much more comfortable leaving probability as it is but using a different term for usefulness.

Comment author: 27 February 2010 08:20:08PM 2 points [-]

But for hypotheses and models, I ask myself "plausibility of what? Being true?"

Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes's theorem gives valid but not necessarily sound plausibility assessments.

following this path shades into decision theory

That's pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.

Comment author: 27 February 2010 09:12:06PM *  1 point [-]

Plausibility of being true given the prior information.

Okay, sure we can go back to the Bayesian mantra of "all probabilities are conditional probabilities". But our prior information effectively includes the statement that one of our models is the "true one". And that's never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn't true. This isn't a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously -- they're conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful -- as an approximation procedure -- sorting through models that are approximately right.

Comment author: 27 February 2010 10:53:11PM *  2 points [-]

And that's never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn't true.

One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of "Bayesian data analysis"; he calls the computing of posterior probabilities and densities "Bayesian inference", a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren't strong. One might consider it an attempt to approximate the Solomonoff prior.

Comment author: 28 February 2010 07:31:41AM 0 points [-]

Yes, in practice people resort to less motivated methods that work well.

I'd really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an "escape hatch" to include new models that the Bayes formalism simply doesn't have any place for.

Comment author: 26 February 2010 09:31:13PM 1 point [-]

the tendency to think of probabilities as inherent properties of objects.

yeah, this was my intuitive reason for thinking frequentists are a little crazy.

Comment author: 26 February 2010 10:47:05PM *  4 points [-]

On the other hand, it's evidence to me that we're talking about different types of minds. Have we identified whether this aspect of frequentism is a choice, or just the way their minds work?

I'm a frequentist, I think, and when I interrogate my intuition about whether 50% heads / 50% tails is a property of a fair coin, it returns 'yes'. However, I understand that this property is an abstract one, and my intuition doesn't make any different empirical predictions about the coin than a Bayesian would. Thus, what difference does it make if I find it natural to assign this property?

In other words, in what (empirically measurable!) sense could it be crazy?

Comment author: 26 February 2010 11:10:56PM 5 points [-]

Well, the immediate objection is that if you hand the coin to a skilled tosser, the frequencies of heads and tails in the tosses can be markedly different than 50%. If you put this probability in the coin, than you really aren't modeling things in a manner that accords with results. You can, of course talk instead about a procedure of coin-tossing, that naturally has to specify the coin as well.

Of course, that merely pushes things back a level. If you completely specify the tossing procedure (people have built coin-tossing machines), then you can repeatedly get 100%/0% splits by careful tuning. If you don't know whether it is tuned to 100% heads or 100% tails, is it still useful to describe this situation probabilistically? A hard-core Frequentist "should" say no, as everything is deterministic. Most people are willing to allow that 50% probability is a reasonable description of the situation. To the extent that you do allow this, you are Bayesian. To the extent that you don't, you're missing an apparently valuable technique.

Comment author: 27 February 2010 01:15:43AM *  2 points [-]

The frequentist can account for the biased toss and determinism, in various ways.

My preferred reply would be that the 50/50 is a property of the symmetry of the coin. (Of course, it's a property of an idealized coin. Heck, a real coin can land balanced on its edge.) If someone tosses the coin in a way that biases the coin, she has actually broken the symmetry in some way with her initial conditions. In particular, the tosser must begin with the knowledge of which way she is holding the coin -- if she doesn't know, she can't bias the outcome of the coin.

I understand that Bayesian's don't tend to abstract things to their idealized forms ... I wonder to what extent Frequentism does this necessarily. (What is the relationship between Frequentism and Platonism?)

Comment author: 27 February 2010 01:55:12AM *  6 points [-]

The frequentist can account for these things, in various ways.

Oh, absolutely. The typical way is choosing some reference class of idealized experiments that could be done. Of course, the right choice of reference class is just as arbitrary as the right choice of Bayesian prior.

My preferred reply would be that the 50/50 is a property of the symmetry of the coin.

Whereas the Bayesian would argue that the 50/50 property is a symmetry about our knowledge of the coin -- even a coin that you know is biased, but that you have no evidence for which way it is biased.

I understand that Bayesian's don't tend to abstract things to their idealized forms

Well, I don't think Bayesians are particularly reluctant to look at idealized forms, it's just that when you can make your model more closely match the situation (without incurring horrendous calculational difficulties) there is a benefit to do so.

And of course, the question is "which idealized form?" There are many ways to idealize almost any situation, and I think talking about "the" idealized form can be misleading. Talking about a "fair coin" is already a serious abstraction and idealization, but it's one that has, of course, proven quite useful.

I wonder to what extent Frequentism does this necessarily. (What is the relationship between Frequentism and Platonism?)

That's a very interesting question.

Comment author: 27 February 2010 08:44:42AM 4 points [-]

What is the relationship between Frequentism and Platonism?

To quote from Gelman's rejoinder that Phil Goetz mentioned,

In a nutshell: Bayesian statistics is about making probability statements, frequentist statistics is about evaluating probability statements.

So, speaking very loosely, Bayesianism is to science, inductive logic, and Aristotelianism as frequentism is to math, deductive logic, and Platonism. That is, Bayesianism is synthesis; frequentism is analysis.

Comment author: 27 February 2010 01:42:35PM 1 point [-]

Interesting! That makes a lot of sense to me, because I had already made connections between science and Aristotelianism, pure math and Platonism.