Calibrating your probability estimates of world events: Russia vs Ukraine, 6 months later.

Shmi

Some of the comments on the link by James_Miller exactly six months ago provided very specific estimates of how the events might turn out:

James_Miller:

The odds of Russian intervening militarily = 40%.
The odds of the Russians losing the conventional battle (perhaps because of NATO intervention) conditional on them entering = 30%.
The odds of the Russians resorting to nuclear weapons conditional on them losing the conventional battle = 20%.

Me:

"Russians intervening militarily" could be anything from posturing to weapon shipments to a surgical strike to a Czechoslovakia-style tank-roll or Afghanistan invasion. My guess that the odds of the latter is below 5%.

A bet between James_Miller and solipsist:

I will bet you $20 U.S. (mine) vs $100 (yours) that Russian tanks will be involved in combat in the Ukraine within 60 days. So in 60 days I will pay you $20 if I lose the bet, but you pay me $100 if I win.

While it is hard to do any meaningful calibration based on a single event, there must be lessons to learn from it. Given that Russian armored columns are said to capture key Ukrainian towns today, the first part of James_Miller's prediction has come true, even if it took 3 times longer than he estimated.

Note that even the most pessimistic person in that conversation (James) was probably too optimistic. My estimate of 5% appears way too low in retrospect, and I would probably bump it to 50% for a similar event in the future.

Now, given that the first prediction came true, how would one reevaluate the odds of the two further escalations he listed? I still feel that there is no way there will be a "conventional battle" between Russia and NATO, but having just been proven wrong makes me doubt my assumptions. If anything, maybe I should give more weight to what James_Miller (or at least Dan Carlin) has to say on the issue. And if I had any skin in the game, I would probably be even more cautious.

Some of the comments on the link by James_Miller exactly six months ago provided very specific estimates of how the events might turn out:

James_Miller:

The odds of Russian intervening militarily = 40%.
The odds of the Russians losing the conventional battle (perhaps because of NATO intervention) conditional on them entering = 30%.
The odds of the Russians resorting to nuclear weapons conditional on them losing the conventional battle = 20%.

Me:

A bet between James_Miller and solipsist:

Maybe a simple example will help. Suppose I have an urn with 100 balls in it. Each ball is either red, yellow or blue. There are, let's say, five different hypotheses about the distribution of colors in the urn - H1, H2, H3, H4 and H5 -- and we're interested in figuring out which hypothesis is correct. The experiment we're conducting is drawing a single ball from the urn and noting its color. I get a new urn after each individual experiment.

There are obviously three possible outcomes for this experiment, and the frequentist will associate a confidence interval with each outcome. The confidence interval for each outcome will be some set of hypotheses (so, for instance, the confidence interval for "yellow" might be {H2, H4}). These intervals are constructed so that, as the experiment is repeated, in the long run the obtained confidence interval will contain the correct hypothesis at least X% of the time (where X is decided by the experimenter). So, for instance, if I use 95% confidence intervals, then in 95% of the experiments I conduct the correct hypothesis will be included in the confidence interval associated with the outcome I obtain.

In other words, if I say, after each experiment, "The correct hypothesis is one of these", and point at the confidence interval I obtained in that experiment, then I will be right 95% of the time. The other 5% of the time I may be wrong, perhaps even obviously wrong.

As a contrived example, suppose each urn I am given contains only 5 red balls. Also suppose the confidence interval I associate with "red" is the empty set, and the confidence interval I associate with both "yellow" and "blue" is the set containing all five hypotheses (H1 through H5). Now as I repeat the experiment over and over again, 95% of the time I will get either yellow or blue balls, and I will point at the set containing all hypotheses and say "The correct hypothesis is one of these", and I will be trivially, obviously right. On the other hand, 5% of the time I will get a red ball, and I will point at the empty set and say "The correct hypothesis is one of these", and I will be trivially, obviously wrong. But since the red ball only shows up 5% of the time, I will still end up being right 95% of the time. This means that the empty set is actually a kosher 95% confidence interval for the outcome "red", even though I know the empty set cannot possibly include the correct hypothesis.

The Bayesian doesn't like this. She wants intervals that make sense in every particular case. She wants to be able to look at the list of hypotheses in a 95% interval and say "There's a 95% chance that the correct hypothesis is one of these". Confidence intervals cannot guarantee this. As we have seen, the empty set can be a legitimate 95% confidence interval, and it's obvious that the chance of the correct hypothesis being part of the empty set is not 95%. This is why the Bayesian uses credible intervals.

Unlike confidence intervals, with a 95% credible interval you get a list at which you can point and say "There's a 95% chance that one of these is the correct hypothesis". And this claim will make sense in every particular instance. Moreover, if your priors are correct (whatever that means), then it is guaranteed that there is a 95% chance that the correct hypothesis is in your 95% credible interval.

Upvoted -- thanks for a long, even if not fully even handed, reply (also it is perhaps not most transparent to explain CIs using a discrete set of hypotheses). I will try to give an example with a continuous valued parameter.

Say we want to estimate the mean height of LW posters. Ignoring the issue of sock puppets for the moment, we could pick LW usernames out of a hat, show up at the person with that username's house, and measure their height. Say we do that for 100 LW users we picked randomly, and take an average, call it X1. The 100 users are a &qu... (read more)

27

Calibrating your probability estimates of world events: Russia vs Ukraine, 6 months later.

27

27

27

Calibrating your probability estimates of world events: Russia vs Ukraine, 6 months later.

27

27