Open Thread, Jul. 27 - Aug 02, 2015

MrMind

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

My understanding of these terms is that the test-giver, knowing Alice, can forecast which questions she'll be able to mostly answer correctly (those are the easy ones) and which questions she will not be able to mostly answer correctly (those are the hard ones).

I agree that if Yvain had predicted what percentage of survey-takers would get each question correct before the survey was released, that would be useful as a measure of the questions' difficulty and an interesting analysis. That was not done in this case.

That makes no sense to me as being an obviously a stupid thing to do, but it may be that the original post argued exactly against this kind of stupidity.

The labeling is not obviously stupid--what questions the LW community has a high probability of getting right is a fact about the LW community, not about Yvain's impression of the LW community. The usage of that label for analysis of calibration does suffer from the issue D_Malik raised, which is why I think Unnamed's analysis is more insightful than Yvain's and their critiques are valid.

However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but.

It is according to what calibration means in the context of probabilities. Like Unnamed points out, if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping ("80%"->80%) instead of a broad mapping ("50%"->50%, "60%"->60%, etc.), it's valid to be skeptical that the calibration will generalize--but it doesn't mean the assessment is uncalibrated.

It is according to what calibration means in the context of probabilities.

Your link actually doesn't provide any information about how to evaluate or estimate someone's calibration which is what we are talking about.

if you are unhappy that we are assigning a property of correct mappings ('calibration') to a narrow mapping

It's not quite that. I'm not happy with this use of averages. I'll need to think more about it, but off the top of my head, I'd look at the average absolute difference between the answer (which is 0 or 1) and the confidence expresse... (read more)