# othercriteria comments on Teaching Bayesianism - Less Wrong

3 08 June 2012 08:18PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

Comment author: 08 June 2012 09:30:01PM *  8 points [-]

It's very straightforward in frequentist interpretation: half the people pick normal die, one in 8 rolls 3, so 1/16 of original people roll 3 off normal die, while 1/2 roll 3 off trick die, for total of 9/16 rolling 3. 1/16 with normal die in 9/16 that roll 3, here's your probability. Trivial stuff people should be able to reinvent if they skip or forget. 4th or 5th grade math at most. Too bad there's no good training at the early enough age.

Train people to think straight and the Bayes will pop up; train people to do Bayes and they'll think wrong with Bayes.

edit: actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.

Comment author: 09 June 2012 11:20:09PM 1 point [-]

actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.

Where do you get the idea that it's a local trope? Knowledgable and well-respected people in the field consider these foundational issues important, e.g., Brad Efron and Andrew Gelman.

You can make an argument that the philosophical details wash out as long as you're operating on a fully specified probability space. In that sense, probability is just sort of syntactic manipulation. But once you start thinking about statistics, where the events and probabilities have some semantic/denotative connection with the real world, you need to care about where the probability space you're working with comes from.

Comment author: 09 June 2012 11:34:31PM *  2 points [-]

Okay, there's a problem for you. Not a neat probability problem. A rectangular dice has sides with length 1cm, 1.1cm, 1.2cm, it is made of 316 stainless steel, the edges and corners are rounded to radius of 1mm , it is dropped onto 10cm thick steel plate made of same type of steel, and bounces several times. What would you do to find probabilities of landing on either side?

Clearly there is no disagreement that 1: agents may represent their uncertainty with probabilities, and 2: physical system such as dice work like a hash function of initial state, such that for perfect dice very nearly exactly 1/6 of initial state space gets transformed into either number, and with several bounces the points in the state space are transformed to different numbers are separated by less than attoradians of initial angle and attoradians per second of initial angular velocity. Effect of small deviations from symmetrical shape could be estimated from physical considerations. The outcomes of any games can be found starting from physics and counting over the states that are consistent with observations that took place; that is likewise not controversial.

Nobody respected disagrees that there exists such property of physical systems that incorporate chaos (act as a hash function, essentially); nobody respected disagrees that you can also have the degrees of beliefs that shouldn't be dutch-bookable; and a bunch of sloppy philosophers whom don't really understand either are very confused going on Bayesianism this Bayesianism that "Good Bayesian", "spoke fluent Bayesian" i kid you not. The latter sort of stuff seems to be local-ish trope.

edit: to summarize, we probably just need two different words, one for property of chaotic physical systems (or hash functions or the like), and other for degrees of belief which only have to obey certain properties between themselves to avoid dutch book or the like. The argument over whichever should be called 'probability' is pretty silly. Anything with dices in it falls straight into chaotic physical systems category.

Comment author: 10 June 2012 12:53:14AM *  3 points [-]

Ignoring, temporarily, everything but the first paragraph, there are two ways I might proceed.

Acting as a frequentist, I would suppose that die rolls could be modeled as independent identically distributed draws from a multinomial distribution with fixed but unknown parameters. (The independence, and to a lesser degree the identically distributed, assumption could also be verified although this gets a bit tricky.) I would roll the die some fixed number of times (possibly determined according to a a priori calculation of statistical power) and take the MLE as a point estimate of the unknown parameters. I would report this parameter as the probability of the die landing on the various sides. I might also report a 95% confidence region for the estimate, which is not to be interpreted as containing the true probabilities 95% of the time (it either does or does not, with certainty).

Acting as a Bayesian, I would assume the same data model, but I would also place a prior distribution on the unknown parameter. A natural prior in this case is the Dirichlet distribution, which is conjugate to the multinomial distribution. I would also use the same data collection approach, although the Bayesian formulation makes it easy to work with the special case of observing a single roll. Given the model likelihood and the prior distribution, Bayes' law tells me the new posterior distribution to which I should update to represent my uncertainty over the unknown parameter. I would continue to roll the die and update until the posterior distribution is sufficiently concentrated according to some reasonable stopping criterion. I would then report the posterior mean (or maybe the MAP estimate) as the probability of the die landing on the various sides. I would also report 95% credible region for the estimate, which I would give a 95% credence to containing the truth (although under questioning, I would probably be evasive/unclear about exactly what that means). I would also need to communicate a justification for my prior distribution and ideally evidence that the inference is not overly sensitive to it. I ought to just report the posterior distribution itself, but people tend to find it easier to base decision on point estimates.

There are obvious similarities to these two inferential approaches, but they are answering slightly different questions using vastly different methods.

Comment author: 10 June 2012 08:16:17AM *  0 points [-]

Suppose you are denied experimentation and denied extremely powerful computer (e.g. you can only do <100 simulated trials but want reasonable accuracy), or need high accuracy in limited time. I was more interested about what you do when you are to try to analytically solve something like this, finding probabilities for the 3 distinct sides.

The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.

Comment author: 10 June 2012 03:23:23PM 1 point [-]

The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better.

Statisticians, by and large, don't lose sleep over this problem. Even in your not-quite-fair die problem, the calculations involved are really hard. It wasn't made explicit in my comment but I wasn't even assuming that opposite sides have equal probability, because some subtle error in the setup could break the symmetry. In the Bayesian case, I was considered mentioning a mixture model that would take advantage of the symmetry if the data supported it. In KDD Cup types of problems, nobody is worried that a domain expert will show up with a winning solution that doesn't even need to see the training data (why would it if it were maximally physically justified?).

putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.

Bayesians have made peace with bias. In fact, decision rules that are both Bayes and unbiased have zero risk, which is a nice way of saying that they don't exist in non-trivial situations. Noorbaloochi and Meeden (1983) have to go through definitional contortions to establish a positive connection between being Bayes and unbiased.

Bias is what lets you get good inferential performance in small-sample regimes. If I observe side counts (2, 0, 1, 3, 2, 2), I'd be okay with my estimator inferring equal side probabilities, because that will be closer to the truth than the unbiased estimator which guesses (0.2, 0.0, 0.1, 0.3, 0.2, 0.2); ten rolls is not enough data to tell me that I should never see a "2". On the other hand, with side counts (200, 0, 100, 300, 200, 200), something closer to the unbiased estimator seems like a good idea. As long as the estimator is asymptotically unbiased, you can even still have consistency.

Unlike cognitive bias, we have control over our statistical bias and we should not be squeamish about using it to learn about the parts of the world that are hard to model with complete accuracy to the extent that we wouldn't need statistics anyways.

Comment author: 10 June 2012 08:14:08PM *  -2 points [-]

The point of the not quite fair die example was to demonstrate where 'probabilities' are coming from. The fair die, after several bounces, maps the initial state space into the final side-up states in a particular way, so that 1/6th of even a very tiny part (hypervolume) of initial state maps to each side-up final state. The not totally fair die is somewhat biased from that. Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

With regard to the statisticians not losing sleep over that, there is a zillion examples in practice where you have to deal with e.g. electric current, or temperature, or illumination, or any other fundamentally statistical property, and you have limited computational power. A lot of my work is for doing this on illumination; I have to compute illumination in a huge number of points on the screen (and no you can't bruteforce even if you had 1000x the computing power, not to mention that when there's 1000x the power you'll have tighter constraints on error and time). I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice. So better for me that some folks just don't understand that you shouldn't get to choose some arbitrary numbers. Yes, in various really fuzzy problems, you can do what ever you subjectively please. But to see this as fundamental - that's quite seriously silly.

There are many methods for finding out the resulting distribution; one particular method involves more regular sampling of the initial state than random (e.g. grid with jittering), so that you get error that improves much better than 1/sqrt(N) ; it can in principle be used for die simulation, and is used in practice in similar problems that are less messy (molecular dynamics comes to mind) . I generally find that nowadays a lot of very important insights are within the more applied fields; the knowledge has not yet propagated into this meta-ish land of arguing mostly over terminology and not having to be maximally correct against golden standard of reality.

Comment author: 10 June 2012 10:58:54PM 2 points [-]

Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

You're sketching out a methodology for solving forward problems (given model, determine observations), which is fine but it's not what motivates statisticians. Statisticians are generally concerned with the backward/inverse problem (given observations, determine model).

In reality, we're not presented with complete and accurate technical specifications for the die/table/thrower system we encounter. All we get to see is the sequence of sides that landed on top. If we're playing a game that uses the die, it's of interest to know how this sequence will continue into the future.

One general approach to figuring this out might involve inferring technical specifications. Maybe if we're really clever, we can figure out what grade of steel the die is made of just from the observed side counts. Less ambitiously, we might try to recover the relative side lengths and rounding radius. With all this information, we can then simulate forward to estimate the sequence of future throws. The number of parameters involved here may number in the tens or hundreds, or into the millions if we want to capture all the physiological details of a human thrower. It's also not quite clear whether a system like this would even converge to any stationary long-term behavior from which limiting relative frequencies could be calculated.

Another approach is ignore all the detail, assume independent identically distributed tosses, and just try to learn the five parameters (P(side 1), ..., P(side 5); P(side 6) = 1 - P(side 1) - ... - P(side 5)). Forward simulation in this case is just repeated sampling from the learned distribution.

Moreover, let's suppose that (effective) independence emerges from the technical specification model. Then we have a huge identifiability problem; all those hundreds of parameters are just providing a redundant parameterization of the iid model. We can't hope to learn all of the parameters from the data we get to observe.

I guess as long as you want to stick to forward problems, you can invoke Occam and deny that probability even exists. But don't assume that your understanding carries over to inverse problems. Probability is a useful technical tool there, and applying it to real problems requires translation/operationalization. Two different frameworks for this are frequentism and Bayesianism.

I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice.

If you want to put your money where your mouth is, I have a proposal. Take a die of your choosing, or manufacture one according to your own specifications; it doesn't have to be remotely fair. Also supply a plate onto which it can be tossed if you desire. Do whatever measurements you want on them. Then convey them to a mutually-accepted third party. The third party rolls the die 200 times, according to instructions you publicly post, and then publicly posts the first half of the sequence of rolls and a hash of the second half of the sequence. We both predict the side counts in the second half of the sequence and post the predictions publicly. The third party reveals the second half of the sequence (which can be checked against the hash) and whoever was closer to the true side counts (in squared error distance) wins. The loser pays the winner some mutually-accepted amount, plus or minus half the die/plate shipping expenses as appropriate to split that cost.

Comment author: 11 June 2012 06:13:42AM *  0 points [-]

I am an applied mathematician who actually does work on finding the values of probabilistic quantities in better computing time than straightforward numerical experimentation. Probability is not just statistics.

In so much as what you think Bayesians do deviates from what I know has to be done, you have a wrong idea of what Bayesians do (or giving you benefit of the doubt at expense of others, are referring to some "Bayesians" whom are plain wrong), or something like that but the discussion is too fuzzy for me to tell which. (Ditto for frequentists)

The point of frequentism is seeing the probability as frequency in infinite number of trials. The point of my die example is to demonstrate that physically the probability plain comes in as frequency, via a function from initial phase space to final phase space that maps, for fair die, 1/6 of initial phase space to each final side-up, this being the objective property of a system that has to be adequately captured by what ever methods you are using. And I do not give a slightest damn if you don't know that in practice - not for dies but for many other systems - you have to find probabilities bottom up from e.g. laws of physics. If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!). Of course I won't bother making for you some example with actually the die, the point is the principle and i've done such solutions before with things that unfortunately don't make great examples.

edit: also, on science, the reason we do 'probability of data given model' is because science follows a strategy of committing to rarely (with certain probability) throwing out valid model. 'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model). The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

Comment author: 13 June 2012 08:51:52PM 0 points [-]

If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!).

The world often isn't nice enough to give us the steel die. Figuratively, the steel die may be inside someone's skull, thousands of years in the past, millions of light-years away, or you may have five slightly different dice and really want to learn about the properties of all dice.

I do understand the O(N^(-1/2)) convergence of errors. I spend a lot of time working on problems where even consistency isn't guaranteed (i.e., nonparametric problems where the "number of parameters" grows in some sense with the amount of data) and finding estimators with such convergence properties would be great there.

'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model).

It's perfectly well-defined. It's just subjective in a way that makes you (and a great number of informed, capable, and thoughtful statisticians) apparently very uneasy. There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer, in spite of choice of prior, given enough data. You probably wouldn't be happy with rates of convergence for these methods, because they tend to be slower and harder to obtain than for, e.g., MLE estimation of iid normally-distributed data.

The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

They might well do this. As a frequentist, this is a natural step in establishing confidence intervals and such, after they have estimated the quantity of interest by choosing the model that maximizes the probability of the data. This choice may not look like "Standard Model versus something else" but it probably looks like "semi-empirical model of the system with parameter 1 = X" where X can range over some reasonable interval.

unless you count stuff like 'Solomonoff induction as a prior'

I don't see what role Solomonoff induction plays in a discussion of frequentism versus Bayesianism. I never mentioned it, I don't know enough about it to use it, and I agree with you that it shows up on LW more as a mantra than as an actual tool.