Looking for information on scoring calibration

Scott Alexander

There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.

But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.

So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.

The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?

There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.

That's actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.

12

Looking for information on scoring calibration

12

12

12

Looking for information on scoring calibration

12

12