For the first time ever I’ve had reason to use formal Bayesian statistics in my work. I feel this is a cause for streamers and confetti.
However, I’ve got a bit stuck on analysing my confidence levels and I thought what better place than lesswrong to check that I’m making sense. I’m not sure this is strictly what lesswrong is for but the site says that this is my own personal blog so I guess its ok?! I can always be down-voted to hell if not!
***
I’m trying to calculate estimated life before failure of a particular component.
We’ve done a number of tests and the results show a larger than expected variance with some components not failing even after an extended lifetime. I’m trying to analyse the results to see which probabilistic failure distribution best suits the available data. I have three different distributions (Weibull, Log-normal & Birnbaum-Saunders) each of which has a shape parameter and a scale parameter.
For each distribution I created a grid which samples the possible values of these parameters. I’ve given the parameters a log uniform prior by giving each sampled parameter pair a uniform prior but sampling the parameters geometrically (i.e. each sampled value of the parameter is a fixed multiple of the previous value). I’ve tried other priors and the results seem fairly robust over choice of prior.
For components which have failed, P(E|H) is the probability density function at the number of hours before failure.
For components which get to a certain age and do not fail, P(E|H) is 1 – the cumulative probability function at this number of hours.
This is implemented on a spreadsheet with a tab for each test result. It updates the prior probability into a posterior probability and then uses this as the prior for the next tab. The result is normalised to give total probability of 1.
Initially I calculate the expected life of the worst component in 1,000,000. For this, I just use the inverse cumulative probability function with p=0.000001 and calculate this for all of the potential probability distributions.
The results of this calculation are multiplied by the final probabilities of each distribution being the correct one. Then I sum this over the entire hypothesis space to give the expected life of the worst component in a million.
So my first question – is all of the above sound or have I made some silly mistake in my logic?
***
The part that I’m less confident about is how to analyse my 95% confidence level of the life of the worst component in 1,000,000.
The obvious way to approach this is that I should just calculate my expected value for the worst component in 20,000,000. Then, for any given million that I select, I have a 5% chance of selecting the worst in 20,000,000. This is treating 95% as my confidence from the overall weighted model.
Alternatively, I can treat the 95% as referring to my confidence in which is the one correct probability distribution. In this case, after I normalise my probabilities so that the sum of all of the hypotheses is 1, I start deleting the least likely hypotheses. I keep deleting unlikely hypotheses until the sum of all of the remaining hypotheses is <0.95. The last hypothesis which was deleted is the top end of my 95% confidence level.
Now if I calculate the expected life of the worst component in 1,000,000 for that individual model I think I can argue that this also represents my 95% confidence level of the worst component in 1,000,000 but in a different way.
Is either of these better than the other? Is there an alternative definition of confidence level which I should be using?
The confidence-in-which-distribution version gives much more conservative answers, particularly when the number of tests is small; the confidence-from-overall-model is much more forgiving of having fewer tests. Even after only a couple of tests the latter gives a 95% confidence level relatively close to the expected value, whereas the confidence-in-which-distribution version remains further away from the expected value until a larger number of tests is performed.
This seems to me to be more realistic but I don’t have a proper argument to actually justify a decision either way.
Any help warmly welcomed.
Birnbaum-Saunders is an interesting one. For the purposes of fatigue analysis, the assumptions which bring about the three models are:
Weibull - numerous failure modes (of similar failure speed) racing to see which causes the component to fail first
Log-normal - Damage per cycle is proportional to current damage
Birnbaum-Saunders - Damage per cycle is normally distributed and independent of previous damage
My engineering gut says that this component is probably somewhere between Log-normal and Birnbaum-Saunders (I think proportionality will decay as damage increases) which is maybe why I don't have a clear winner yet.
***
I think I understand now where my original reasoning was incorrect when I was calculating the expected worst in a million. I was just calculating worst in a million for each model and taking a weighted average of the answers. This meant that bad values from the outlier potential pdfs were massively suppressed.
I've done some sampling of the worst in a million by repeatedly creating 2 random numbers from 0 to 1. I use the first to select a μ,σ combination based on the posterior for each pair. I use the second random number as a p-value. I then icdf those values to get an x.
Is this an ok sampling method? I'm not sure if I'm missing something or should be using MCMC. I definitely need to read up on this stuff!
The worst in a million is currently dominated by those occasions where the probability distribution is an outlier. In those cases the p-value doesn't need to be particularly extreme to achieve low x.
I think my initial estimates were based either mainly on uncertainty in p or mainly on uncertainty in μ,σ. The sampling method allows me to account for uncertainty in both which definitely makes more sense. The model seems to react sensibly when I add potential new data so I think I can assess much better now how many data points I require.