Vaniver comments on Open Thread, Jul. 27 - Aug 02, 2015 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (220)
If we don't agree about what it is, it will be very difficult to agree how to evaluate it!
Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I've grouped together 8 predictions and labeled them "80%", and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my "80%" confidence level as 50%.
If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is--getting 100% correct at the "80%" level isn't that meaningful if I only used "80%" twice!).
You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert's ability to correlate their answer with the ground truth) and calibration (an expert's ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.
*That is, assuming that the expert knows what rule you're using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.
Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It's unclear what kind will work best here and what that "best" even means.
Yes, thank you, that's useful.
Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.