What does your accuracy tell you about your confidence interval?

HonoreDB

Yvain's 2011 Less Wrong Census/Survey is still ongoing throughout November, 2011. If you haven't taken it, please do before reading on, or at least write down your answers to the calibration questions so they won't get skewed by the following discussion.

The survey includes these questions:

Calibration YearWithout checking a source, in what year do you estimate [redacted event happened]?

Calibration AnswerWithout checking a source, estimate the probability that the answer you just gave is within 15 years either way of the correct answer.

In the comments, several people including myself wondered what our level of accuracy in the first question said about the calibration of our answer to the second question. If your guess for the first question was really close to correct, but your probability for the second question was low, were you underconfident? If you were far off, but your probability was high, were you overconfident?

We could test our calibration by simply answering a lot of these pairs of questions, then applying a proper scoring rule. But that seems like throwing out information. Surely we could calibrate faster if we're allowed to use our accuracy as evidence?

I suspect there are people on here with the tools to work this out trivially. Here's my try at it:

Suppose you state a p-confidence interval of ±a around your guess x of the true value X. Then you find that, actually, |X - x| = b. What does this say about your confidence interval?

As a first approximation, we can represent your confidence interval as a claim that the answer is uniformly randomly placed within an interval of ±(a/p), and that you have guessed uniformly within the same interval. If this is the case, your guess should on average be ±(1/3 * a/p) off, following a triangular distribution. It should be in the range (1/3 ± 3/16)(a/p) half the time. It should be less than 1/3(3 - sqrt(6)), or about .18, 1/3 of the time, and greater than 1-1/(sqrt(3), or about .42, 1/3 of the time.

So, here's a rule of thumb for evaluating your confidence intervals based on how close you're getting to the actual answer. Again, a is the radius of your interval, and p is the probability you assigned that the answer is in that interval.

1. Determine how far you were off, divide by a, and multiply by p.

2. If your result is less than .18 more than a third of the time, you're being underconfident. If your result is greater than .42 more than a third of the time, you're being overconfident.

In my case, I was 2 years off, and estimated a probability of .85 that I was within 15 years. So my result is 2/15 * .85 = .11333... That's less than the lower threshold. If I find this happening more than 1/3 of the time, I'm being underconfident.

Can anybody suggest a better system?

The survey includes these questions:

Calibration YearWithout checking a source, in what year do you estimate [redacted event happened]?

Calibration AnswerWithout checking a source, estimate the probability that the answer you just gave is within 15 years either way of the correct answer.

I suspect there are people on here with the tools to work this out trivially. Here's my try at it:

Suppose you state a p-confidence interval of ±a around your guess x of the true value X. Then you find that, actually, |X - x| = b. What does this say about your confidence interval?

1. Determine how far you were off, divide by a, and multiply by p.

2. If your result is less than .18 more than a third of the time, you're being underconfident. If your result is greater than .42 more than a third of the time, you're being overconfident.

Can anybody suggest a better system?

The way this is typically done is by eliciting more than two numbers to build the distribution out of. For example, I might ask you for a date so early that you think there's only a 5% chance it happened before that date, then a date so late that you think there's only a 5% chance it happened after that date, then try to figure out the tertiles or quartiles.

Notice that I worked from the outside in- when people try to come up with a central estimate and then imagine variance around that central estimate, like in Yvain's elicitation, they do significantly worse than if guided by an well-designed process. (You can see an example of an expert elicitation process here.)

One you've done this, you've got more detailed bins, and you can evaluate the bin populations. ("Hm, I only have 10% in my lower tertile- I ought to adjust my estimates downwards.")

People often fit distributions based on elicited values, but they'll talk a lot with the experts about shape, to make sure it fits the expert's beliefs. (They tend to use things a lot more sophisticated than uniforms, generally chosen so that Bayesian updates are convenient.) I don't think I've seen much of that in the domain of calibration, though.

[edit] You could use that fitting procedure to produce a more precise estimate of your p, and then use that in your proper scoring rule to determine your score in negentropy, and so this could be useful for calibration. While I think this could increase precision in your calibration measurement, I don't know if it would actually improve the accuracy of your calibration measurement. When doing statistics, it's hard to make up for lack of data through use of clever techniques.

Thanks for that link, and for pointing out the technique which seems like a good hack. (In the nice sense of the word.)

7

What does your accuracy tell you about your confidence interval?

7

7

7

What does your accuracy tell you about your confidence interval?

7

7