Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

# private_messaging comments on Teaching Bayesianism - Less Wrong Discussion

3 08 June 2012 08:18PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

Comment author: 10 June 2012 08:14:08PM *  -2 points [-]

The point of the not quite fair die example was to demonstrate where 'probabilities' are coming from. The fair die, after several bounces, maps the initial state space into the final side-up states in a particular way, so that 1/6th of even a very tiny part (hypervolume) of initial state maps to each side-up final state. The not totally fair die is somewhat biased from that. Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

With regard to the statisticians not losing sleep over that, there is a zillion examples in practice where you have to deal with e.g. electric current, or temperature, or illumination, or any other fundamentally statistical property, and you have limited computational power. A lot of my work is for doing this on illumination; I have to compute illumination in a huge number of points on the screen (and no you can't bruteforce even if you had 1000x the computing power, not to mention that when there's 1000x the power you'll have tighter constraints on error and time). I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice. So better for me that some folks just don't understand that you shouldn't get to choose some arbitrary numbers. Yes, in various really fuzzy problems, you can do what ever you subjectively please. But to see this as fundamental - that's quite seriously silly.

There are many methods for finding out the resulting distribution; one particular method involves more regular sampling of the initial state than random (e.g. grid with jittering), so that you get error that improves much better than 1/sqrt(N) ; it can in principle be used for die simulation, and is used in practice in similar problems that are less messy (molecular dynamics comes to mind) . I generally find that nowadays a lot of very important insights are within the more applied fields; the knowledge has not yet propagated into this meta-ish land of arguing mostly over terminology and not having to be maximally correct against golden standard of reality.

Comment author: 10 June 2012 10:58:54PM 2 points [-]

Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.

You're sketching out a methodology for solving forward problems (given model, determine observations), which is fine but it's not what motivates statisticians. Statisticians are generally concerned with the backward/inverse problem (given observations, determine model).

In reality, we're not presented with complete and accurate technical specifications for the die/table/thrower system we encounter. All we get to see is the sequence of sides that landed on top. If we're playing a game that uses the die, it's of interest to know how this sequence will continue into the future.

One general approach to figuring this out might involve inferring technical specifications. Maybe if we're really clever, we can figure out what grade of steel the die is made of just from the observed side counts. Less ambitiously, we might try to recover the relative side lengths and rounding radius. With all this information, we can then simulate forward to estimate the sequence of future throws. The number of parameters involved here may number in the tens or hundreds, or into the millions if we want to capture all the physiological details of a human thrower. It's also not quite clear whether a system like this would even converge to any stationary long-term behavior from which limiting relative frequencies could be calculated.

Another approach is ignore all the detail, assume independent identically distributed tosses, and just try to learn the five parameters (P(side 1), ..., P(side 5); P(side 6) = 1 - P(side 1) - ... - P(side 5)). Forward simulation in this case is just repeated sampling from the learned distribution.

Moreover, let's suppose that (effective) independence emerges from the technical specification model. Then we have a huge identifiability problem; all those hundreds of parameters are just providing a redundant parameterization of the iid model. We can't hope to learn all of the parameters from the data we get to observe.

I guess as long as you want to stick to forward problems, you can invoke Occam and deny that probability even exists. But don't assume that your understanding carries over to inverse problems. Probability is a useful technical tool there, and applying it to real problems requires translation/operationalization. Two different frameworks for this are frequentism and Bayesianism.

I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice.

If you want to put your money where your mouth is, I have a proposal. Take a die of your choosing, or manufacture one according to your own specifications; it doesn't have to be remotely fair. Also supply a plate onto which it can be tossed if you desire. Do whatever measurements you want on them. Then convey them to a mutually-accepted third party. The third party rolls the die 200 times, according to instructions you publicly post, and then publicly posts the first half of the sequence of rolls and a hash of the second half of the sequence. We both predict the side counts in the second half of the sequence and post the predictions publicly. The third party reveals the second half of the sequence (which can be checked against the hash) and whoever was closer to the true side counts (in squared error distance) wins. The loser pays the winner some mutually-accepted amount, plus or minus half the die/plate shipping expenses as appropriate to split that cost.

Comment author: 11 June 2012 06:13:42AM *  0 points [-]

I am an applied mathematician who actually does work on finding the values of probabilistic quantities in better computing time than straightforward numerical experimentation. Probability is not just statistics.

In so much as what you think Bayesians do deviates from what I know has to be done, you have a wrong idea of what Bayesians do (or giving you benefit of the doubt at expense of others, are referring to some "Bayesians" whom are plain wrong), or something like that but the discussion is too fuzzy for me to tell which. (Ditto for frequentists)

The point of frequentism is seeing the probability as frequency in infinite number of trials. The point of my die example is to demonstrate that physically the probability plain comes in as frequency, via a function from initial phase space to final phase space that maps, for fair die, 1/6 of initial phase space to each final side-up, this being the objective property of a system that has to be adequately captured by what ever methods you are using. And I do not give a slightest damn if you don't know that in practice - not for dies but for many other systems - you have to find probabilities bottom up from e.g. laws of physics. If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!). Of course I won't bother making for you some example with actually the die, the point is the principle and i've done such solutions before with things that unfortunately don't make great examples.

edit: also, on science, the reason we do 'probability of data given model' is because science follows a strategy of committing to rarely (with certain probability) throwing out valid model. 'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model). The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

Comment author: 13 June 2012 08:51:52PM 0 points [-]

If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!).

The world often isn't nice enough to give us the steel die. Figuratively, the steel die may be inside someone's skull, thousands of years in the past, millions of light-years away, or you may have five slightly different dice and really want to learn about the properties of all dice.

I do understand the O(N^(-1/2)) convergence of errors. I spend a lot of time working on problems where even consistency isn't guaranteed (i.e., nonparametric problems where the "number of parameters" grows in some sense with the amount of data) and finding estimators with such convergence properties would be great there.

'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model).

It's perfectly well-defined. It's just subjective in a way that makes you (and a great number of informed, capable, and thoughtful statisticians) apparently very uneasy. There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer, in spite of choice of prior, given enough data. You probably wouldn't be happy with rates of convergence for these methods, because they tend to be slower and harder to obtain than for, e.g., MLE estimation of iid normally-distributed data.

The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.

They might well do this. As a frequentist, this is a natural step in establishing confidence intervals and such, after they have estimated the quantity of interest by choosing the model that maximizes the probability of the data. This choice may not look like "Standard Model versus something else" but it probably looks like "semi-empirical model of the system with parameter 1 = X" where X can range over some reasonable interval.

unless you count stuff like 'Solomonoff induction as a prior'

I don't see what role Solomonoff induction plays in a discussion of frequentism versus Bayesianism. I never mentioned it, I don't know enough about it to use it, and I agree with you that it shows up on LW more as a mantra than as an actual tool.

Comment author: 14 June 2012 07:32:06AM *  0 points [-]

The world often isn't nice enough to give us the steel die.

The point is that the probability with die comes in as frequency (the fraction of initial phase space). Yes, sometimes nature doesn't give you die; that does not invalidate the fact that there exists probability as objective property of a physical process, as per frequentism (related to how the process maps initial phase space to final phase space); the methods employing subjectivity have to try to conform to this objective property as closely as possible (e.g. by trying to know more about how the system works). The Bayesianism is not opposed to this, unless we are to speak of some terribly broken Bayesianism.

'Probability of model given the data' is not well defined,

It's perfectly well-defined.

Nope. Only the change to probability of model given the data is well defined. The probability itself isn't. You can pick arbitrary start point.

There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer,

The notion of 'true answer' is frequentist....

edit: Recall that the original argument was about the trope of Bayesianism being opposed to frequentism etc. here. The point with Solomonoff induction is that once you declare something like this a source of priors, all math youll be doing should be completely identical to frequentist math (when frequencies are within turing machines fed random tape, and the math is done as in my top level post for die), just as long as you don't simply screw your math up. The point with die example was that no Bayesianist worth their salt opposes to there being a property of chaotic process, what fraction of initial phase space gets mapped to where, because there really is this property.