Open Thread, Jul. 27 - Aug 02, 2015

MrMind

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

Your link actually doesn't provide any information about how to evaluate or estimate someone's calibration which is what we are talking about.

If we don't agree about what it is, it will be very difficult to agree how to evaluate it!

It's not quite that. I'm not happy with this use of averages.

Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I've grouped together 8 predictions and labeled them "80%", and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my "80%" confidence level as 50%.

If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is--getting 100% correct at the "80%" level isn't that meaningful if I only used "80%" twice!).

I'll need to think more about it, but off the top of my head, I'd look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares... But don't quote me on that, I'm just thinking aloud here.

You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert's ability to correlate their answer with the ground truth) and calibration (an expert's ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.

*That is, assuming that the expert knows what rule you're using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.

If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear

Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It's unclear what kind will work best here and what that "best" even means.

You may find the Wikipedia page on scoring rules interesting.

Yes, thank you, that's useful.

Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.