You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Vaniver comments on Open Thread, Jul. 27 - Aug 02, 2015 - Less Wrong Discussion

5 Post author: MrMind 27 July 2015 07:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (220)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vaniver 28 July 2015 11:41:53PM *  0 points [-]

My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she'll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones).

I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions' difficulty and an interesting analysis. That was not done in this case.

That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.

The labeling is not obviously stupid--what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain's impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed's analysis is more insightful than Yvain's and their critiques are valid.

However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but.

It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping ("80%"->80%) instead of a broad mapping ("50%"->50%, "60%"->60%, etc.), it's valid to be skeptical that the calibration will generalize--but it doesn't mean the assessment is uncalibrated.

Comment author: Lumifer 29 July 2015 12:09:29AM 0 points [-]

It is according to what calibration means in the context of probabilities.

Your link actually doesn't provide any information about how to evaluate or estimate someone's calibration which is what we are talking about.

if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping

It's not quite that. I'm not happy with this use of averages. I'll need to think more about it, but off the top of my head, I'd look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares... But don't quote me on that, I'm just thinking aloud here.

Comment author: Vaniver 29 July 2015 01:27:26AM *  1 point [-]

Your link actually doesn't provide any information about how to evaluate or estimate someone's calibration which is what we are talking about.

If we don't agree about what it is, it will be very difficult to agree how to evaluate it!

It's not quite that. I'm not happy with this use of averages.

Surely it makes sense to use averages to determine the probability of being correct for any given confidence level. If I've grouped together 8 predictions and labeled them "80%", and 4 of them are correct and 4 of them are incorrect, it seems sensible to describe my correctness at my "80%" confidence level as 50%.

If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear, which is why many papers on calibration will present the entire graph (along with individualized error bars to make clear how unlikely any particular correctness value is--getting 100% correct at the "80%" level isn't that meaningful if I only used "80%" twice!).

I'll need to think more about it, but off the top of my head, I'd look at the average absolute difference between the answer (which is 0 or 1) and the confidence expressed, or maybe the square root of the sum of squares... But don't quote me on that, I'm just thinking aloud here.

You may find the Wikipedia page on scoring rules interesting. My impression is that it is difficult to distinguish between skill (an expert's ability to correlate their answer with the ground truth) and calibration (an expert's ability to correlate their reported probability with their actual correctness) with a single point estimate,* but something like the slope that Unnamed discusses here is a solid attempt.

*That is, assuming that the expert knows what rule you're using and is incentivized by a high score, you also want the rule to be proper, where the expert maximizes their expected reward by supplying their true estimate of the probability.

Comment author: Lumifer 29 July 2015 02:26:03AM 0 points [-]

If one wants to measure my correctness across multiple confidence levels, then what aggregation procedure to use is unclear

Yes, that is precisely the issue for me here. Essentially, you have to specify a loss function and then aggregate it. It's unclear what kind will work best here and what that "best" even means.

You may find the Wikipedia page on scoring rules interesting.

Yes, thank you, that's useful.

Notably, Philip Tetlock in his Expert Political Judgement project uses Brier scoring.