For the first time ever I’ve had reason to use formal Bayesian statistics in my work. I feel this is a cause for streamers and confetti.
However, I’ve got a bit stuck on analysing my confidence levels and I thought what better place than lesswrong to check that I’m making sense. I’m not sure this is strictly what lesswrong is for but the site says that this is my own personal blog so I guess its ok?! I can always be down-voted to hell if not!
***
I’m trying to calculate estimated life before failure of a particular component.
We’ve done a number of tests and the results show a larger than expected variance with some components not failing even after an extended lifetime. I’m trying to analyse the results to see which probabilistic failure distribution best suits the available data. I have three different distributions (Weibull, Log-normal & Birnbaum-Saunders) each of which has a shape parameter and a scale parameter.
For each distribution I created a grid which samples the possible values of these parameters. I’ve given the parameters a log uniform prior by giving each sampled parameter pair a uniform prior but sampling the parameters geometrically (i.e. each sampled value of the parameter is a fixed multiple of the previous value). I’ve tried other priors and the results seem fairly robust over choice of prior.
For components which have failed, P(E|H) is the probability density function at the number of hours before failure.
For components which get to a certain age and do not fail, P(E|H) is 1 – the cumulative probability function at this number of hours.
This is implemented on a spreadsheet with a tab for each test result. It updates the prior probability into a posterior probability and then uses this as the prior for the next tab. The result is normalised to give total probability of 1.
Initially I calculate the expected life of the worst component in 1,000,000. For this, I just use the inverse cumulative probability function with p=0.000001 and calculate this for all of the potential probability distributions.
The results of this calculation are multiplied by the final probabilities of each distribution being the correct one. Then I sum this over the entire hypothesis space to give the expected life of the worst component in a million.
So my first question – is all of the above sound or have I made some silly mistake in my logic?
***
The part that I’m less confident about is how to analyse my 95% confidence level of the life of the worst component in 1,000,000.
The obvious way to approach this is that I should just calculate my expected value for the worst component in 20,000,000. Then, for any given million that I select, I have a 5% chance of selecting the worst in 20,000,000. This is treating 95% as my confidence from the overall weighted model.
Alternatively, I can treat the 95% as referring to my confidence in which is the one correct probability distribution. In this case, after I normalise my probabilities so that the sum of all of the hypotheses is 1, I start deleting the least likely hypotheses. I keep deleting unlikely hypotheses until the sum of all of the remaining hypotheses is <0.95. The last hypothesis which was deleted is the top end of my 95% confidence level.
Now if I calculate the expected life of the worst component in 1,000,000 for that individual model I think I can argue that this also represents my 95% confidence level of the worst component in 1,000,000 but in a different way.
Is either of these better than the other? Is there an alternative definition of confidence level which I should be using?
The confidence-in-which-distribution version gives much more conservative answers, particularly when the number of tests is small; the confidence-from-overall-model is much more forgiving of having fewer tests. Even after only a couple of tests the latter gives a 95% confidence level relatively close to the expected value, whereas the confidence-in-which-distribution version remains further away from the expected value until a larger number of tests is performed.
This seems to me to be more realistic but I don’t have a proper argument to actually justify a decision either way.
Any help warmly welcomed.
I'll walk through how I'd analyze this problem, let me know if I haven't answered your questions by the end.
First, problem structure. You have three unknowns, which you want to estimate from the data: a shape parameter, a scale parameter, and an indicator of which model is correct.
"Which model is correct" is probably easiest to start with. I'm not completely sure that I've followed what your spreadsheet is doing, but if I understand correctly, then that's probably an overly-complicated way to tackle the problem. You want P(model|data), which will be determined via Bayes' Rule by P(data|model) and your prior probability for each model being correct. The prior is unlikely to matter much unless your dataset is tiny, so P(data|model) is the important part. That's an integral:
P(data|model)=∫μ,σ∏iP(xi|μ,σ,model)dP(μ,σ)
In your case, you're approximating that integral with a grid over μ and σ (dP(μ,σ) is a shorthand for ρ(μ,σ)dμdσ here). Rather than whatever you're doing with timesteps, you can probably just take the product of P(xi|μ,σ,model), where xi is the lifetime of the ith component in your dataset, then sum over the grid. (If you are dealing with online data streaming in, then you would need to do the timestep thing.)
That takes care of the model part. Once that's done, you'll probably find that one model is like a gazillion times more likely than the other two, and you can just throw out the other two.
On to the 95% CI for the worst part in a million.
The distribution you're interested in here is P(xworst|model,data). xworst is an order statistic. Its CDF is basically the CDF for any old point x raised to the power of 1000000; read up on it to see exactly what expression to use. So if we wanted to do this analytically, we'd first compute P(x|model,data) via Bayes' Rule:
P(x|model,data)=P(data∪x|model)P(data|model)
... where both pieces on the right would involve our integral from earlier. Basically, you imagine adding one more point x to the dataset and see what that would do to P(data|model). If we had a closed-form expression for that distribution, then we could just raise the CDF to the millionth power, we'd get a closed-form expression for the millionth order statistic, and from there we'd get a 95% CI in the usual way.
In practice, that's probably difficult, so let's talk about how to approximate it numerically.
First, the order statistic part. As long as we can sample from the posterior distribution P(x|model,data), that part's easy: generate a million samples of x, take the worst, and you have a sample of xworst. Repeat that process a bunch of times to compute the 95% CI. (This is not the same as the worst component in 20M, but it's not any harder to code up.)
Next, the posterior distribution for x. This is going to be driven by two pieces: uncertainty in the parameters μ and σ, and random noise from the distribution itself. If the dataset is large enough, then the uncertainty in μ and σ will be small, so the distribution itself will be the dominant term. In that case, we can just find the best-fit (i.e. maximum a-posteriori) estimates of μ and σ, and then declare that P(x|model,data) is approximately the standard distribution (Weibull, log-normal, etc) with those exact parameters. Presumably we can sample from any of those distributions with known parameters, so we go do the order statistic part and we're done.
If the uncertainty in μ and σ is not small enough to ignore, then the problem gets more complicated - we'll need to sample from the posterior distribution P(μ,σ|model,data). At that point we're in the land of Laplace approximation and MCMC and all that jazz; I'm not going to walk through it here, because this comment is already really long.
So one last thing to wrap it up. I wrote all that out because it's a great example problem of how to Bayes, but there's still a big problem at the model level: the lifetime of the millionth-worst component is probably driven by qualitatively different processes than the vast majority of other components. If some weird thing happens one time in 10000, and causes problems in the components, then a best-fit model of the whole dataset probably won't pick it up at all. Nice-looking distributions like Weibull or log-normal just aren't good at modelling two qualitatively different behaviors mixed into the same dataset. There's probably some standard formal way of dealing with this kind of thing - I hear that "Rare Event Modelling" is a thing, although I know nothing about it - but the fundamental problem is just getting any relevant data at all. If we only have a hundred thousand data points, and we think that millionth-worst is driven by qualitatively different processes, then we have zero data on the millionth-worst, full stop. On the other hand, if we have a billion data points, then we can just throw out all but the worst few thousand and analyse only those.
Ok, that sounds right.