(Note: I use the American radix point, except in quotes, where I preserve loldrup's.)
That beta distribution will have more built in uncertainty if based on a sample size of 100 rather than a sample size of 1.000.000, but that's the only difference (right?).
Remember that the posterior is the combination of the prior and the likelihood, weighted by the precision of each. The beta(1,1) prior (the famous 'uniform' prior) gives us the estimate that 50% of the material a machine outputs is going to be defective. If the true rate is 5%, and we somehow get the mode sample each time, the posterior will be closer to the truth in the (50,001, 950,001) case than in the (6,96) case. If we had the prior belief that, say, 2.4% of the material a machine outputs is defective, and decided our belief was strong enough to justify a (24,976) prior (which has a much higher precision than the (1,1) distribution), you'll notice that 1M datapoints does much more to correct our faulty prior than 100 datapoints. (In the case where we get a perverse sample, of course, the stronger prior is more resistant.)
Would a solution be to make a Bayesian update for each individual observation of faulty/not-faulty products from machine x? Curiously this would seem to move the problem from a mathematical analysis to a brute force computational task (unless all that Bayesian updating can be neatly modelled)
You may be interested in conjugate priors. If I started off with a beta prior (defined by two parameters, alpha and beta), and I observe an event with a Bernoulli likelihood (a product is faulty or not faulty), then I can immediately calculate the posterior distribution by just adjusting the hyperparameters. If my priors are not conjugate to my likelihood, then I have to do a bunch of integrations to get my new posterior, and this is often done by brute force computation.
I see how this will work for a continuous distribution like the beta distribution. Visually the effect of a high number of samples will be that the curve is more sharply centered on the most probable part of the curve. The outlier cases are more quickly becoming improbable as we move outwards.
But then this must mean that the discrete, "perfect", "infinite-sample" likelihood distribution used in the Wikipedia example must have a very high influence on the posterior, almost marginalising the effect if the prior. Do I reason correctly here...
In the introductory example in the Wikipedia article on the Bayesian theorem, they start out with a prior distribution for P(machine_ID | faulty_product)* and then updates this using a likelihood distribution P(faulty_product | machine_ID) to acquire a posterior distribution for P(machine_ID | faulty_product).
How did they come up with the likelihood distribution? Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...
We don't know which sample size is used: the likelihood distribution doesn't reveal this. Thus this matter doesn't influence the weight of the Bayesian update. But shouldn't it do so? Uncertain likelihood distributions should have a small influence and vice versa. How do I make the bayesian update reflect this?
I read the links provided by somervta in the 'Error margins' discussion from yesterday, but I'm not skillful enough to adapt them to this example.
* technically they just make the prior distribution a clone of the distribution P(machine_ID) but I like to keep the identity across the Bayesian update so I gave the prior and the posterior distribution the same form: P(machine_ID | faulty_product).