D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
For example, I say "heads" every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately--that is, look at the situation retrospectively--then you would think that I was simultaneously underconfident and overconfident.
To make it clearer what it should look like normally, suppose there are two coins, red and blue. The red coin lands heads 80% of the time and the blue coin lands heads 70% of the time, and we alternate between flipping the red coin and the blue coin.
If I always answer heads, with 80% when it's red and 70% when it's blue, I will be as calibrated as someone who always answers heads with 75%, but will have more skill. But retrospectively, one will be able to make the claim that we are underconfident and overconfident.
D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
Yes, I agree with that. However it still seems to me that the example with coins is misleading and that the given example of "perfect calibration" is anything but. Let me try to explain.
Since we're talking about calibration, let's not use coin flips but...
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.