It's very straightforward in frequentist interpretation: half the people pick normal die, one in 8 rolls 3, so 1/16 of original people roll 3 off normal die, while 1/2 roll 3 off trick die, for total of 9/16 rolling 3. 1/16 with normal die in 9/16 that roll 3, here's your probability. Trivial stuff people should be able to reinvent if they skip or forget. 4th or 5th grade math at most. Too bad there's no good training at the early enough age.
Train people to think straight and the Bayes will pop up; train people to do Bayes and they'll think wrong with Bayes.
edit: actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.
There are practically relevant considerations that emerge from the philosophical distinction between Bayesianism and frequentism. If you have an epistemic conception of probability, then it makes sense to talk about the probability distribution of a theoretical parameter, such as the mean of some variable in a population. If you're a frequentist, though, this usually does not make sense. The variable itself has a relative frequency associated with each of its values, but it makes no sense to talk of the relative frequency of the mean of the variable. So on a frequentist conception of probability, you wouldn't assign a probability distribution over the theoretical parameter. The parameter is not treated as a random variable.
The upshot is that frequentists focus on likelihood functions, which a Bayesian would describe as giving the probability of observed data conditional on the theoretical parameter. Bayesians, on the other hand, look for posterior probability distributions, the distribution of the theoretical parameter conditional on observed data. Bayesians report the epistemic probability that the parameter value is in a certain interval, given the data. For frequentists, the interval is the random variable, and the unknown parameter is treated as fixed. So the probabilities they report are the relative frequency (in an ensemble of experiments) with which the confidence interval constructed using observed data will contain the theoretical parameter.
So the philosophical difference has important methodological consequences in statistics. These are explored at length in many advanced textbooks in theoretical statistics, such as this one. The distinction isn't just a local trope; it's an important foundational distinction in statistics.
If you have an epistemic conception of probability, then it makes sense to talk about the probability distribution of a theoretical parameter, such as the mean of some variable in a population.
Please go on with an example of how it is practically relevant, so that the frequentism fails.
The Bayesianism here with the Solomonoff induction as a prior, is identical to frequentism over Turing machines anyway (or at least should be; if you make mistakes it won't be)
With the local trope, the local trope seem to be a complete misunderstanding of books such as the one you linked.
Please go on with an example of how it is practically relevant, so that the frequentism fails.
For an actual scientific example of Bayesian and frequentist methods yielding different results when applied to the same problem, see Wagenmakers et al.'s criticisms [PDF] of Bem's precognition experiments.
Here's a toy example that (according to Bayesians, at least) illustrates a defect of frequentist methodology. You draw two random values from a uniform distribution with unknown mean m and known width 1. Let these values be v1 and v2, with v1 < v2. If you did this experiment repeatedly, then you would expect that 50% of the time, the interval (v1, v2) would include the population mean m. So according to the frequentist, this is the 50% confidence interval.
Suppose that on a particular run of the experiment, you get v1 = 0.1 and v2 = 1.0. For this particular data, the Bayesian would say that given our model, there is a 100% chance of the mean lying in the interval (v1, v2). The consistent frequentist, however, cannot say this. She can't talk about the probability of the mean lying in the interval, she can only talk about the relative frequency with which the interval (considered as a random variable) will contain the mean, and this remains 50%. So she will say that the interval (0.1, 0.9) is a 50% confidence interval. The Bayesian charge is that by refusing to conditionalize on the actual data available to her, the frequentist has missed important information: specifically, that the mean of the distribution is definitely between 0.1 and 1.0.
actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"? The math abstracts out the philosophical detail of whenever probability is a degree of belief or product of convergence of long term trials.
Where do you get the idea that it's a local trope? Knowledgable and well-respected people in the field consider these foundational issues important, e.g., Brad Efron and Andrew Gelman.
You can make an argument that the philosophical details wash out as long as you're operating on a fully specified probability space. In that sense, probability is just sort of syntactic manipulation. But once you start thinking about statistics, where the events and probabilities have some semantic/denotative connection with the real world, you need to care about where the probability space you're working with comes from.
Okay, there's a problem for you. Not a neat probability problem. A rectangular dice has sides with length 1cm, 1.1cm, 1.2cm, it is made of 316 stainless steel, the edges and corners are rounded to radius of 1mm , it is dropped onto 10cm thick steel plate made of same type of steel, and bounces several times. What would you do to find probabilities of landing on either side?
Clearly there is no disagreement that 1: agents may represent their uncertainty with probabilities, and 2: physical system such as dice work like a hash function of initial state, such that for perfect dice very nearly exactly 1/6 of initial state space gets transformed into either number, and with several bounces the points in the state space are transformed to different numbers are separated by less than attoradians of initial angle and attoradians per second of initial angular velocity. Effect of small deviations from symmetrical shape could be estimated from physical considerations. The outcomes of any games can be found starting from physics and counting over the states that are consistent with observations that took place; that is likewise not controversial.
Nobody respected disagrees that there exists such property of physical systems that incorporate chaos (act as a hash function, essentially); nobody respected disagrees that you can also have the degrees of beliefs that shouldn't be dutch-bookable; and a bunch of sloppy philosophers whom don't really understand either are very confused going on Bayesianism this Bayesianism that "Good Bayesian", "spoke fluent Bayesian" i kid you not. The latter sort of stuff seems to be local-ish trope.
edit: to summarize, we probably just need two different words, one for property of chaotic physical systems (or hash functions or the like), and other for degrees of belief which only have to obey certain properties between themselves to avoid dutch book or the like. The argument over whichever should be called 'probability' is pretty silly. Anything with dices in it falls straight into chaotic physical systems category.
Ignoring, temporarily, everything but the first paragraph, there are two ways I might proceed.
Acting as a frequentist, I would suppose that die rolls could be modeled as independent identically distributed draws from a multinomial distribution with fixed but unknown parameters. (The independence, and to a lesser degree the identically distributed, assumption could also be verified although this gets a bit tricky.) I would roll the die some fixed number of times (possibly determined according to a a priori calculation of statistical power) and take the MLE as a point estimate of the unknown parameters. I would report this parameter as the probability of the die landing on the various sides. I might also report a 95% confidence region for the estimate, which is not to be interpreted as containing the true probabilities 95% of the time (it either does or does not, with certainty).
Acting as a Bayesian, I would assume the same data model, but I would also place a prior distribution on the unknown parameter. A natural prior in this case is the Dirichlet distribution, which is conjugate to the multinomial distribution. I would also use the same data collection approach, although the Bayesian formulation makes it easy to work with the special case of observing a single roll. Given the model likelihood and the prior distribution, Bayes' law tells me the new posterior distribution to which I should update to represent my uncertainty over the unknown parameter. I would continue to roll the die and update until the posterior distribution is sufficiently concentrated according to some reasonable stopping criterion. I would then report the posterior mean (or maybe the MAP estimate) as the probability of the die landing on the various sides. I would also report 95% credible region for the estimate, which I would give a 95% credence to containing the truth (although under questioning, I would probably be evasive/unclear about exactly what that means). I would also need to communicate a justification for my prior distribution and ideally evidence that the inference is not overly sensitive to it. I ought to just report the posterior distribution itself, but people tend to find it easier to base decision on point estimates.
There are obvious similarities to these two inferential approaches, but they are answering slightly different questions using vastly different methods.
Suppose you are denied experimentation and denied extremely powerful computer (e.g. you can only do <100 simulated trials but want reasonable accuracy), or need high accuracy in limited time. I was more interested about what you do when you are to try to analytically solve something like this, finding probabilities for the 3 distinct sides.
The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.
The point here is that you want to go for physically justified stuff, and anything not physically justified that you are doing anywhere, is same in principle as wilfully putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better.
Statisticians, by and large, don't lose sleep over this problem. Even in your not-quite-fair die problem, the calculations involved are really hard. It wasn't made explicit in my comment but I wasn't even assuming that opposite sides have equal probability, because some subtle error in the setup could break the symmetry. In the Bayesian case, I was considered mentioning a mixture model that would take advantage of the symmetry if the data supported it. In KDD Cup types of problems, nobody is worried that a domain expert will show up with a winning solution that doesn't even need to see the training data (why would it if it were maximally physically justified?).
putting cognitive bias into your calculations, and is just plain wrong, no philosophical stuff here, you'll end up losing games vs someone who solves it better. Maybe you guys need "Overcoming Bayes" blog.
Bayesians have made peace with bias. In fact, decision rules that are both Bayes and unbiased have zero risk, which is a nice way of saying that they don't exist in non-trivial situations. Noorbaloochi and Meeden (1983) have to go through definitional contortions to establish a positive connection between being Bayes and unbiased.
Bias is what lets you get good inferential performance in small-sample regimes. If I observe side counts (2, 0, 1, 3, 2, 2), I'd be okay with my estimator inferring equal side probabilities, because that will be closer to the truth than the unbiased estimator which guesses (0.2, 0.0, 0.1, 0.3, 0.2, 0.2); ten rolls is not enough data to tell me that I should never see a "2". On the other hand, with side counts (200, 0, 100, 300, 200, 200), something closer to the unbiased estimator seems like a good idea. As long as the estimator is asymptotically unbiased, you can even still have consistency.
Unlike cognitive bias, we have control over our statistical bias and we should not be squeamish about using it to learn about the parts of the world that are hard to model with complete accuracy to the extent that we wouldn't need statistics anyways.
The point of the not quite fair die example was to demonstrate where 'probabilities' are coming from. The fair die, after several bounces, maps the initial state space into the final side-up states in a particular way, so that 1/6th of even a very tiny part (hypervolume) of initial state maps to each side-up final state. The not totally fair die is somewhat biased from that. Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.
With regard to the statisticians not losing sleep over that, there is a zillion examples in practice where you have to deal with e.g. electric current, or temperature, or illumination, or any other fundamentally statistical property, and you have limited computational power. A lot of my work is for doing this on illumination; I have to compute illumination in a huge number of points on the screen (and no you can't bruteforce even if you had 1000x the computing power, not to mention that when there's 1000x the power you'll have tighter constraints on error and time). I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice. So better for me that some folks just don't understand that you shouldn't get to choose some arbitrary numbers. Yes, in various really fuzzy problems, you can do what ever you subjectively please. But to see this as fundamental - that's quite seriously silly.
There are many methods for finding out the resulting distribution; one particular method involves more regular sampling of the initial state than random (e.g. grid with jittering), so that you get error that improves much better than 1/sqrt(N) ; it can in principle be used for die simulation, and is used in practice in similar problems that are less messy (molecular dynamics comes to mind) . I generally find that nowadays a lot of very important insights are within the more applied fields; the knowledge has not yet propagated into this meta-ish land of arguing mostly over terminology and not having to be maximally correct against golden standard of reality.
Any problems involving die can be solved from first principles all the way from this through selection of the parts of initial state that are compatible with observation, to the answer.
You're sketching out a methodology for solving forward problems (given model, determine observations), which is fine but it's not what motivates statisticians. Statisticians are generally concerned with the backward/inverse problem (given observations, determine model).
In reality, we're not presented with complete and accurate technical specifications for the die/table/thrower system we encounter. All we get to see is the sequence of sides that landed on top. If we're playing a game that uses the die, it's of interest to know how this sequence will continue into the future.
One general approach to figuring this out might involve inferring technical specifications. Maybe if we're really clever, we can figure out what grade of steel the die is made of just from the observed side counts. Less ambitiously, we might try to recover the relative side lengths and rounding radius. With all this information, we can then simulate forward to estimate the sequence of future throws. The number of parameters involved here may number in the tens or hundreds, or into the millions if we want to capture all the physiological details of a human thrower. It's also not quite clear whether a system like this would even converge to any stationary long-term behavior from which limiting relative frequencies could be calculated.
Another approach is ignore all the detail, assume independent identically distributed tosses, and just try to learn the five parameters (P(side 1), ..., P(side 5); P(side 6) = 1 - P(side 1) - ... - P(side 5)). Forward simulation in this case is just repeated sampling from the learned distribution.
Moreover, let's suppose that (effective) independence emerges from the technical specification model. Then we have a huge identifiability problem; all those hundreds of parameters are just providing a redundant parameterization of the iid model. We can't hope to learn all of the parameters from the data we get to observe.
I guess as long as you want to stick to forward problems, you can invoke Occam and deny that probability even exists. But don't assume that your understanding carries over to inverse problems. Probability is a useful technical tool there, and applying it to real problems requires translation/operationalization. Two different frameworks for this are frequentism and Bayesianism.
I don't really care if some people don't find anything wrong with doing a wrong thing "because we won't be beaten in practice", when I am earning some of my money by beating those folks in practice.
If you want to put your money where your mouth is, I have a proposal. Take a die of your choosing, or manufacture one according to your own specifications; it doesn't have to be remotely fair. Also supply a plate onto which it can be tossed if you desire. Do whatever measurements you want on them. Then convey them to a mutually-accepted third party. The third party rolls the die 200 times, according to instructions you publicly post, and then publicly posts the first half of the sequence of rolls and a hash of the second half of the sequence. We both predict the side counts in the second half of the sequence and post the predictions publicly. The third party reveals the second half of the sequence (which can be checked against the hash) and whoever was closer to the true side counts (in squared error distance) wins. The loser pays the winner some mutually-accepted amount, plus or minus half the die/plate shipping expenses as appropriate to split that cost.
I am an applied mathematician who actually does work on finding the values of probabilistic quantities in better computing time than straightforward numerical experimentation. Probability is not just statistics.
In so much as what you think Bayesians do deviates from what I know has to be done, you have a wrong idea of what Bayesians do (or giving you benefit of the doubt at expense of others, are referring to some "Bayesians" whom are plain wrong), or something like that but the discussion is too fuzzy for me to tell which. (Ditto for frequentists)
The point of frequentism is seeing the probability as frequency in infinite number of trials. The point of my die example is to demonstrate that physically the probability plain comes in as frequency, via a function from initial phase space to final phase space that maps, for fair die, 1/6 of initial phase space to each final side-up, this being the objective property of a system that has to be adequately captured by what ever methods you are using. And I do not give a slightest damn if you don't know that in practice - not for dies but for many other systems - you have to find probabilities bottom up from e.g. laws of physics. If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!). Of course I won't bother making for you some example with actually the die, the point is the principle and i've done such solutions before with things that unfortunately don't make great examples.
edit: also, on science, the reason we do 'probability of data given model' is because science follows a strategy of committing to rarely (with certain probability) throwing out valid model. 'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model). The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.
If you are given steel die to physically experiment with, there again are a lot better (faster) ways to find out the probabilities, than just tossing (do you even understand that your errors converge as 1/sqrt(N) , or how important of an issue is that in practice?!).
The world often isn't nice enough to give us the steel die. Figuratively, the steel die may be inside someone's skull, thousands of years in the past, millions of light-years away, or you may have five slightly different dice and really want to learn about the properties of all dice.
I do understand the O(N^(-1/2)) convergence of errors. I spend a lot of time working on problems where even consistency isn't guaranteed (i.e., nonparametric problems where the "number of parameters" grows in some sense with the amount of data) and finding estimators with such convergence properties would be great there.
'Probability of model given the data' is not well defined, unless you count stuff like 'Solomonoff induction as a prior', where it is defined but not computable (and is mathematically homologous to assigning probability of 1 to the 'we live inside Turing machine' model).
It's perfectly well-defined. It's just subjective in a way that makes you (and a great number of informed, capable, and thoughtful statisticians) apparently very uneasy. There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer, in spite of choice of prior, given enough data. You probably wouldn't be happy with rates of convergence for these methods, because they tend to be slower and harder to obtain than for, e.g., MLE estimation of iid normally-distributed data.
The experimental physicists publish probability of data given model; people can then combine that with their priors if they want.
They might well do this. As a frequentist, this is a natural step in establishing confidence intervals and such, after they have estimated the quantity of interest by choosing the model that maximizes the probability of the data. This choice may not look like "Standard Model versus something else" but it probably looks like "semi-empirical model of the system with parameter 1 = X" where X can range over some reasonable interval.
unless you count stuff like 'Solomonoff induction as a prior'
I don't see what role Solomonoff induction plays in a discussion of frequentism versus Bayesianism. I never mentioned it, I don't know enough about it to use it, and I agree with you that it shows up on LW more as a mantra than as an actual tool.
The world often isn't nice enough to give us the steel die.
The point is that the probability with die comes in as frequency (the fraction of initial phase space). Yes, sometimes nature doesn't give you die; that does not invalidate the fact that there exists probability as objective property of a physical process, as per frequentism (related to how the process maps initial phase space to final phase space); the methods employing subjectivity have to try to conform to this objective property as closely as possible (e.g. by trying to know more about how the system works). The Bayesianism is not opposed to this, unless we are to speak of some terribly broken Bayesianism.
'Probability of model given the data' is not well defined,
It's perfectly well-defined.
Nope. Only the change to probability of model given the data is well defined. The probability itself isn't. You can pick arbitrary start point.
There's some theory that gives pretty general conditions under which Bayesian procedures converge to the true answer,
The notion of 'true answer' is frequentist....
edit: Recall that the original argument was about the trope of Bayesianism being opposed to frequentism etc. here. The point with Solomonoff induction is that once you declare something like this a source of priors, all math youll be doing should be completely identical to frequentist math (when frequencies are within turing machines fed random tape, and the math is done as in my top level post for die), just as long as you don't simply screw your math up. The point with die example was that no Bayesianist worth their salt opposes to there being a property of chaotic process, what fraction of initial phase space gets mapped to where, because there really is this property.
actually, what's up with this local trope of "Bayesianism" as opposed to "Frequentism"?
Eliezer got this straight from Jaynes.
And he wrote a sizable post about the conflict: Frequentist Statistics are Frequently Subjective
The process of throwing away the actual experimental result, and substituting a class of possible results which contains the actual one - that is, deliberately losing some of your information - introduces a dose of real subjectivity.
Edit: didn't mean to retract this, hit the button by accident.
Another example of him having poor knowledge and going on confused and irrelevant for pages. LW is very effective at throwing away anyone who has a clue by referencing to highly loved incorrectness.
LW is very effective at throwing away anyone who has a clue by referencing to highly loved incorrectness.
In the name of Cryonics, Bayesianism, MWI, FAI, FOOMing, physical realism and whatever other ideas that lesswrong folks endorse but you have a problem with I banish you!
Did it work?
I'm not sure how much Bayes is getting through here - people are primed to associate "3" with the trick die, so you can get the right answer via availability or similarity heuristics. I'm trying to think of a way to reformulate this without the priming while keeping it close-to-simple - maybe s/"told you that the number that landed was a 3"/"told you that the number that landed was odd"
I've had a bit of success with getting people to understand Bayesianism at parties and such
Oooh, I know this one). If the blonde is a 9 and you have a 30% chance with her and her friends are all 7s and you have a 40% chance with them, but that chance reduces to 15% if you have already been rejected by the blonde, which girl should you approach?
It may be useful to actually type out how you use the above thought experiment to explain Bayes. That would make it more useful for those of us still confused or unsure about what Bayes means (hey, I'm a newbie, be nice), and it would help people critique the example in how it teaches the theorem.
For example, why is it better to ask, "If a 3 is pulled, is it more likely to be an 8-sided dice or not?" than to ask, "If a random dice is rolled, is it more likely to be a 3 or not?"
For example, why is it better to ask, "If a 3 is pulled, is it more likely to be an 8-sided dice or not?" than to ask, "If a random dice is rolled, is it more likely to be a 3 or not?"
Good question. The second question is "just a probability" question. The first question asks you to condition on evidence ("If the randomly chosen die is rolled and comes up 3") and infer "backward" to what this tells you about the die. That's why Bayesian reasoning applies.
The reasoning goes like this: before I roll the die, the two kinds of dice are equally likely.
Then I rolled the die and saw a three. The conditional probability of this if the die is eight sided is 1/8. The conditional probability of this if the die is only 3's is 1.
The Bayesian update is to multiply out the probability of observing the evidence in the two cases:
And then renormalize:
JQuinton mentioned that he uses this to argue about falsifiability. I'd like to hear that explained more. I think the example is meant to show that a hypothesis that "can explain anything" (the 8 sided die), should lose probability if we obtain evidence that is "better explained" by the more specific hypothesis (the 3's only die).
JQuinton mentioned that he uses this to argue about falsifiability. I'd like to hear that explained more. I think the >example is meant to show that a hypothesis that "can explain anything" (the 8 sided die), should lose probability if >we obtain evidence that is "better explained" by the more specific hypothesis (the 3's only die).
Yes, that's correct. The thing I was trying to illustrate is that some hypotheses are more falsifiable than others. A hypothesis that can explain too much data (e.g. a 1,000 sided die) would lose probability to a more restricted hypothesis like a 6 sided die if the numbers 1 - 6 are rolled. The compliment to that is if the numbers 7 - 1,000 are rolled this refutes the idea that the 6 sided die was rolled. Accounting for too much data and falsifiability are two sides of the same coin; explaining too much data tends towards unfalsfiability.
If I understand the point you're trying to make, you might try an example with curve fitting. If some data in a scatterplot is well explained by a line plus noise, then that's a better explanation than trying to draw ever more complicated curves that go through all the data exactly. Of course, identifying the very best model that has a few wiggles and less unexplained scatter is actually pretty tricky [c.f. AIC/BIC/cross-validation/splines].
I wonder if doing the Monty Hall / Monty Fall / Monty Crawl would be a better way to teach probabilistic reasoning. (You'd probably want to start off with the fall, then move to the crawl, then move to the hall, if you want people to get them right.)
I've had a bit of success with getting people to understand Bayesianism at parties and such, and I'm posting this thought experiment that I came up with to see if it can be improved or if an entirely different thought experiment would be grasped more intuitively in that context:
I originally came up with this idea to explain falsifiability which is why I didn't go with say the example in the better article on Bayesianism (i.e. any other number besides a 3 rolled refutes the possibility that the trick die was picked) and having a hypothesis that explains too much contradictory data, so eventually I increase the sides that the die has (like a hypothetical 50-sided die), the different types of die in the jar (100-sided, 6-sided, trick die), and different distributions of die in the jar (90% of the die are 200-sided but a 3 is rolled, etc.). Again, I've been discussing this at parties where alcohol is flowing and cognition is impaired yet people understand it, so I figure if it works there then it can be understood intuitively by many people.