How did they come up with the likelihood distribution? Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...
We don't know which sample size is used: the likelihood distribution doesn't reveal this.
Implicitly, they have an infinite sample size, because the distribution on P(B|A_1) is infinitely precise. Suppose we also wanted to learn P(B|A_1) from the history of the factory: then we might model it as having a fixed rate of defective outputs, and the probability we assign to particular defect rates is a beta distribution. We might start off with a Jeffreys prior and then update as we see the tool produce defective or normal products, eventually ending up with, say, a beta(5.5,95.5) for tool A_1.
Exercise for the reader: given that hyperparameter distribution for P(B|A_1) (and similar ones for A_2 and A_3), do we need the full hyperparameter distribution for all three tools to determine the probability that a known defective output came off of A_1, or can we get the same answer using only a handful of moments from each distribution?
Hayrff guvf vf gevpxvre guna vg frrzf, whfg gur svefg zbzrag bs rnpu qvfgevohgvba fubhyq qb. (Sbe guvf ernfba V qvfnterr gung gur Jvxv negvpyr vzcyvpvgyl nffhzrf vasvavgr fnzcyr fvmr. Gur pbaqvgvbany cebonovyvgvrf hfrq va gur pnyphyngvba ner gur svefg zbzragf (= pbafgnagf) bs gur erfcrpgvir cnenzrgre qvfgevohgvbaf, abg gur cnenzrgref gurzfryirf (= enaqbz inevnoyrf).)
whfg gur svefg zbzrag bs rnpu qvfgevohgvba fubhyq qb.
Yep.
(Sbe guvf ernfba V qvfnterr gung gur Jvxv negvpyr vzcyvpvgyl nffhzrf vasvavgr fnzcyr fvmr. Gur pbaqvgvbany cebonovyvgvrf hfrq va gur pnyphyngvba ner gur svefg zbzragf (= pbafgnagf) bs gur erfcrpgvir cnenzrgre qvfgevohgvbaf, abg gur cnenzrgref gurzfryirf (= enaqbz inevnoyrf).)
V zbfgyl nterr jvgu guvf. V nterr gung lbh bayl arrq gur svefg zbzrag bs lbhe cbfgrevbe gb pnyphyngr jung gurl nfx sbe, ohg V guvax gung gurz cebivqvat gur ulcrecnenzrgre nf n fvatyr qngncbvag vf vzcyvpvgyl pynvzvat na vasvavgr cerpvfvba naq guhf na vasvavgr fnzcyr fvmr (be n pregnva haqreylvat zbqry), va gur fnzr jnl gung zl orgn qvfgevohgvba zbqry gung qbrfa'g vapyhqr gvzr vf vzcyvpvgyl pynvzvat gung gur znpuvar vf rdhnyyl yvxryl gb cebqhpr qrsrpgvir zngrevny ng nyy gvzrf qhevat vgf bcrengvba. Hayrff nffhzcgvbaf / zbqry fvzcyvsvpngvbaf yvxr gung ner rkcyvpvgyl npxabjyrqtrq, vg znxrf frafr gb pnyy gurz vzcyvpvg.
That beta distribution will have more built in uncertainty if based on a sample size of 100 rather than a sample size of 1.000.000, but that's the only difference (right?). In the Bayesian update they still have the same weight. Isn't this unfair to the large sample size likelihood distribution? Shouldn't it have more weight in the Bayesian update?
Would a solution be to make a Bayesian update for each individual observation of faulty/not-faulty product from machine x? Curiously this would seem to move the problem from a mathematical analysis to a brute force computational task (unless all that Bayesian updating can be neatly modelled)
(Note: I use the American radix point, except in quotes, where I preserve loldrup's.)
That beta distribution will have more built in uncertainty if based on a sample size of 100 rather than a sample size of 1.000.000, but that's the only difference (right?).
Remember that the posterior is the combination of the prior and the likelihood, weighted by the precision of each. The beta(1,1) prior (the famous 'uniform' prior) gives us the estimate that 50% of the material a machine outputs is going to be defective. If the true rate is 5%, and we somehow get the mode sample each time, the posterior will be closer to the truth in the (50,001, 950,001) case than in the (6,96) case. If we had the prior belief that, say, 2.4% of the material a machine outputs is defective, and decided our belief was strong enough to justify a (24,976) prior (which has a much higher precision than the (1,1) distribution), you'll notice that 1M datapoints does much more to correct our faulty prior than 100 datapoints. (In the case where we get a perverse sample, of course, the stronger prior is more resistant.)
Would a solution be to make a Bayesian update for each individual observation of faulty/not-faulty products from machine x? Curiously this would seem to move the problem from a mathematical analysis to a brute force computational task (unless all that Bayesian updating can be neatly modelled)
You may be interested in conjugate priors. If I started off with a beta prior (defined by two parameters, alpha and beta), and I observe an event with a Bernoulli likelihood (a product is faulty or not faulty), then I can immediately calculate the posterior distribution by just adjusting the hyperparameters. If my priors are not conjugate to my likelihood, then I have to do a bunch of integrations to get my new posterior, and this is often done by brute force computation.
I see how this will work for a continuous distribution like the beta distribution. Visually the effect of a high number of samples will be that the curve is more sharply centered on the most probable part of the curve. The outlier cases are more quickly becoming improbable as we move outwards.
But then this must mean that the discrete, "perfect", "infinite-sample" likelihood distribution used in the Wikipedia example must have a very high influence on the posterior, almost marginalising the effect if the prior. Do I reason correctly here?
And does this "infinite-sample" likelihood distribution really have such a strong effect in the Wikipedia example? (I don't know how to judge this)
I suspect we should make clear two points under discussion: first, the rate of defective material that a machine spits out, and second, there is the question of how much knowing that material is defective tells us about what machine processed it.
satt's comment handles the second point; when we are trying to estimate which machine produced a single defective product, the sample size of products is, by necessity, one. (Because we've implicitly assumed that the defectivity of products is independent, sampling more of them isn't really any more interesting than sampling one of them.)
But in order to do that calculation, we need some information about how much defective product each machine produces. As it turns out, we only need the first moment (i.e. mean) of that estimate; higher moments (like the variance) don't show up in the calculation. (Is it clear to you how to verify that statement?) So a 5% chance that I'm absolutely certain of and a 5% chance that comes from a guess lead to the same final output.
And does this "infinite-sample" likelihood distribution really have such a strong effect in the Wikipedia example? (I don't know how to judge this)
For many probabilistic calculations, it's helpful to do a sensitivity analysis. That is, we jiggle around the inputs we gave (like the percentage of the total output that each machine produces, or the defectivity rate of each machine, and so on) to determine how strongly they influence the outcome of the procedure. If we were just guessing with the 5% number, but we discover that dropping it to 4% makes a huge difference, then maybe we should go back and refine our estimate to be sure that it's 5% instead of 4%. If the number is roughly the same, then our estimate is probably good enough.
If only the mean if the likelihood distribution is involved, not the variance, then truly the sample size used when creating the likelihood distribution has no influence on the Bayesian update.
Then the next question is: is it a problem? If I understand you correctly then your answer is: "not really, because ".
Then it's only the part I don't get.
You ask me if it's clear to me why only the mean if the likelihood distribution is involved in the Bayesian update. Well, it isn't currently, but I'll read the article "Continuous Bayes" and see if it then becomes more clear to me:
http://www.sidhantgodiwala.com/blog/2015/03/14/continuous-bayes/
How did they come up with the likelihood distribution?
The likelihood distribution is a mathematical restatement of the earlier sentence "The fraction of defective items produced is this: for the first machine, 5%; for the second machine, 3%; for the third machine, 1%". In other words, a (uniformly) randomly chosen item produced by the first machine has a 5% chance of being defective, so P(B|A1) = 0.05, et mutatis mutandis for the other two machines.
Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...
The sample size comes in at "If an item is chosen at random from the total output and is found to be defective" — "an item", hence N = 1.
We don't know which sample size is used: the likelihood distribution doesn't reveal this.
This information is encoded in the likelihood, but that's not explicitly noted so it may not be obvious. Had more than one item been chosen at random from the output, the likelihood would be different (and the hypothesis being tested, "what is the probability that it was produced by the third machine?", would have to be changed too to make sense with the new N).
In the introductory example in the Wikipedia article on the Bayesian theorem, they start out with a prior distribution for P(machine_ID | faulty_product)* and then updates this using a likelihood distribution P(faulty_product | machine_ID) to acquire a posterior distribution for P(machine_ID | faulty_product).
How did they come up with the likelihood distribution? Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...
We don't know which sample size is used: the likelihood distribution doesn't reveal this. Thus this matter doesn't influence the weight of the Bayesian update. But shouldn't it do so? Uncertain likelihood distributions should have a small influence and vice versa. How do I make the bayesian update reflect this?
I read the links provided by somervta in the 'Error margins' discussion from yesterday, but I'm not skillful enough to adapt them to this example.
* technically they just make the prior distribution a clone of the distribution P(machine_ID) but I like to keep the identity across the Bayesian update so I gave the prior and the posterior distribution the same form: P(machine_ID | faulty_product).