Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

PhilGoetz comments on What is Bayesianism? - Less Wrong

81 Post author: Kaj_Sotala 26 February 2010 07:43AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (211)

You are viewing a single comment's thread. Show more comments above.

Comment author: PhilGoetz 27 February 2010 05:47:54AM *  2 points [-]

What does that mean for frequentist statistical inference? Well, it's forbidden to assign probabilities to anything that is deterministic in your model of reality.

Wait - Bayesians can assign probabilities to things that are deterministic? What does that mean?

What would a Bayesian do instead of a T-test?

Comment author: wnoise 27 February 2010 10:34:45AM *  13 points [-]

Wait - Bayesians can assign probabilities to things that are deterministic? What does that mean?

Absolutely!

The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is "deterministic", as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer's body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I've been dealt, and what (if any) are public is straightforward. Incorporating player's actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents' likely hands.

What would a Bayesian do instead of a T-test?

In most cases we'd step back, and ask what you were trying to do, such that a T-test seemed like a good idea.

For those unaware, a t-test is a way of calculating the "likelihood" for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say "we can't reject it". If it is terribly unlikely, the Frequentists say that it can be rejected -- that it's worth looking at another model.

From a Bayesian perspective, this is somewhat backwards -- we don't really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let's just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some "prior probability" of each model -- how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).

In this framework, we don't have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can't just reject a theory if the data don't match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation -- we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.

There are a couple unsatisfying bits here:
First it really would be nice to say "this theory is ridiculous because it doesn't explain the data" without any reference to any other theory. But if we know it's the only theory in town, we don't have a choice. If it's not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are.
Second, we need to give "prior probabilities" to our various theories, and the math doesn't give any direct justifications for what these should be. However, as long as these aren't crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren't will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we'll only pick a good model if we have a good model in this space of theories. If you eventually realize that "hey, all these models are crappy", there is no good way of expanding the set of models you're willing to consider, though a common way is to just "start over" with an expanded model space, and reallocated prior probabilities. You can't just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren't actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).

A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn't much meaning -- only in comparison to the rest in a set tells you anything useful.

Comment author: Cyan 27 February 2010 01:35:12PM 1 point [-]

Very nice! I'd only replace "useful" with "plausible". (Sure, it's hard to define plausibility, but usefulness is not really the right concept.)

Comment author: wnoise 27 February 2010 07:19:00PM *  3 points [-]

"Usefulness" certainly isn't the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes's Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses "actually are" probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.

But precisely because only the manipulation rules matter, I'm not sure it is worth emphasizing that "to be a good Bayesian" you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I'm not sure I'm comfortable doing that. Data and event probabilities are things that can eventually be "resolved" to true or false, by looking after the fact. Probability as plausibility makes sense for these things.

But for hypotheses and models, I ask myself "plausibility of what? Being true?" Almost certainly, the "real" model (when that even makes sense) isn't in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely -- effectively that all data points are drawn from the same distribution. Data often doesn't behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.

I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how "tractable" the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

Comment author: Douglas_Knight 27 February 2010 11:25:03PM 5 points [-]

...shades into decision theory...Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?

Comment author: wnoise 27 February 2010 11:32:36PM *  1 point [-]

Well, I said shaded into the lines of decision theory...

Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they're essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn't use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy -- not earlier.

EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze -- they never make it into our set of hypotheses in the first place. It's amazing how often conjugate priors "just happen" to be chosen...

Comment author: wedrifid 27 February 2010 11:12:08PM 2 points [-]

Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they're implausible, but because we don't want to use them unless the data force us to.

I am much more comfortable leaving probability as it is but using a different term for usefulness.

Comment author: Cyan 27 February 2010 08:20:08PM 2 points [-]

But for hypotheses and models, I ask myself "plausibility of what? Being true?"

Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes's theorem gives valid but not necessarily sound plausibility assessments.

following this path shades into decision theory

That's pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.

Comment author: wnoise 27 February 2010 09:12:06PM *  1 point [-]

Plausibility of being true given the prior information.

Okay, sure we can go back to the Bayesian mantra of "all probabilities are conditional probabilities". But our prior information effectively includes the statement that one of our models is the "true one". And that's never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn't true. This isn't a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously -- they're conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful -- as an approximation procedure -- sorting through models that are approximately right.

Comment author: Cyan 27 February 2010 10:53:11PM *  2 points [-]

And that's never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn't true.

One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of "Bayesian data analysis"; he calls the computing of posterior probabilities and densities "Bayesian inference", a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren't strong. One might consider it an attempt to approximate the Solomonoff prior.

Comment author: wnoise 28 February 2010 07:31:41AM 0 points [-]

Yes, in practice people resort to less motivated methods that work well.

I'd really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an "escape hatch" to include new models that the Bayes formalism simply doesn't have any place for.

Comment author: Cyan 28 February 2010 02:56:58PM 0 points [-]

I feel the same way.