Well, let's walk through the scenario.
Alice is given 100 calibration questions. She knows that some of them are easy and some are hard. She doesn't know how many are easy and how many are hard.
Alice goes through the 100 questions and at the end -- according to how I understand D_Malik's scenario -- she says "I have no idea whether any particular question is hard or easy, but I think that out of this hundred 80 questions are easy. I just don't know which ones". And, under the assumption that 80 question were indeed easy, this is supposed to represent perfect calibration.
That makes no sense to me at all.
D_Malik's scenario illustrates that it doesn't make sense to partition the questions based on observed difficulty and then measure calibration, because this will induce a selection effect. The correct procedure to partition the questions based on expected difficulty and then measure calibration.
For example, I say "heads" every time for the coin, with 80% confidence. That says to you that I think all flips are equally hard to predict prospectively. But if you were to compare my track record for heads and tails separately--that is, look at the situ...
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.