How do I make sample size matter?

In the introductory example in the Wikipedia article on the Bayesian theorem, they start out with a prior distribution for P(machine_ID | faulty_product)* and then updates this using a likelihood distribution P(faulty_product | machine_ID) to acquire a posterior distribution for P(machine_ID | faulty_product).

How did they come up with the likelihood distribution? Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...

We don't know which sample size is used: the likelihood distribution doesn't reveal this. Thus this matter doesn't influence the weight of the Bayesian update. But shouldn't it do so? Uncertain likelihood distributions should have a small influence and vice versa. How do I make the bayesian update reflect this?

I read the links provided by somervta in the 'Error margins' discussion from yesterday, but I'm not skillful enough to adapt them to this example.

* technically they just make the prior distribution a clone of the distribution P(machine_ID) but I like to keep the identity across the Bayesian update so I gave the prior and the posterior distribution the same form: P(machine_ID | faulty_product).

I read the links provided by somervta in the 'Error margins' discussion from yesterday, but I'm not skillful enough to adapt them to this example.

How did they come up with the likelihood distribution? Maybe they sampled 100 products from each machine and for each sample counted the number of faulty products. Maybe they sampled 1.000.000 products from each machine...

We don't know which sample size is used: the likelihood distribution doesn't reveal this.

Implicitly, they have an infinite sample size, because the distribution on P(B|A_1) is infinitely precise. Suppose we also wanted to learn P(B|A_1) from the history of the factory: then we might model it as having a fixed rate of defective outputs, and the probability we assign to particular defect rates is a beta distribution. We might start off with a Jeffreys prior and then update as we see the tool produce defective or normal products, eventually ending up with, say, a beta(5.5,95.5) for tool A_1.

Exercise for the reader: given that hyperparameter distribution for P(B|A_1) (and similar ones for A_2 and A_3), do we need the full hyperparameter distribution for all three tools to determine the probability that a known defective output came off of A_1, or can we get the same answer using only a handful of moments from each distribution?

Hayrff guvf vf gevpxvre guna vg frrzf, whfg gur svefg zbzrag bs rnpu qvfgevohgvba fubhyq qb. (Sbe guvf ernfba V qvfnterr gung gur Jvxv negvpyr vzcyvpvgyl nffhzrf vasvavgr fnzcyr fvmr. Gur pbaqvgvbany cebonovyvgvrf hfrq va gur pnyphyngvba ner gur svefg zbzragf (= pbafgnagf) bs gur erfcrpgvir cnenzrgre qvfgevohgvbaf, abg gur cnenzrgref gurzfryirf (= enaqbz inevnoyrf).)

2[anonymous]11y

That beta distribution will have more built in uncertainty if based on a sample size of 100 rather than a sample size of 1.000.000, but that's the only difference (right?). In the Bayesian update they still have the same weight. Isn't this unfair to the large sample size likelihood distribution? Shouldn't it have more weight in the Bayesian update? Would a solution be to make a Bayesian update for each individual observation of faulty/not-faulty product from machine x? Curiously this would seem to move the problem from a mathematical analysis to a brute force computational task (unless all that Bayesian updating can be neatly modelled)