I ran into the same problem a while back and became frustrated that there wasn't an elegant answer. It should at least be possible to unambiguously spot under- and over-confidence, but even this is not clear.
I guess we need to define exactly what we're trying to measure and then treat it as an estimation problem, where each response is a data point stochastically generated from some hidden "calibration" parameter. But this is rife with mind-projection-fallacy pitfalls because the respondents' reasoning processes and probability assignments need to be treated as objective parts of the world (which they are, but it's just hard not to keep it all from becoming confused).
There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.
But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?