There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.
But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.
So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.
The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?
tl;dr : miscalibration means mentally interpreting loglikelihood of data as being more or less than its actual loglikelihood; to infer it you need to assume/infer the Bayesian calculation that's being made/approximated. Easiest with distributions over finite sets (i.e. T/F or multiple-choice questions). Also, likelihood should be called evidence.
I wonder why I didn't respond to this when it was fresh. Anyway, I was running into this same difficulty last summer when attempting to write software to give friendly outputs (like "calibration") to a bunch of people playing the Aumann game with trivia questions.
My understanding was that evidence needs to be measured on the logscale (as the difference between prior and posterior), and miscalibration is when your mental conversion from gut feeling of evidence to the actual evidence has a multiplicative error in it. (We can pronounce this as: "the true evidence is some multiplicative factor (called the calibration parameter) times the felt evidence".) This still seems like a reasonable model, though of course different kinds of evidence are likely to have different error magnitudes, and different questions are likely to get different kinds of evidence, so if you have lots of data you can probably do better by building a model that will estimate your calibration for particular questions.
But sticking to the constant-calibration model, it's still not possible to estimate your calibration from your given confidence intervals because for that we need an idea of what your internal prior (your "prior" prior, before you've taken into account the felt evidence) is, which is hard to get any decent sense of, though you can work off of iffy assumptions, such as assuming that your prior for percentage answers from a trivia game is fitted to the set of all the percentage answers from this trivia game, and has some simple form (e.g. Beta). The Aumann game gave an advantage in this respect, because rather than comparing your probability distribution before&after thinking about the question, it makes it possible to compare the distribution before&after hearing other people's arguments&evidence; if you always speak in terms of standard probability distributions, it's not too hard to infer your calibration there.
Further "funny" issues can arise when you get down to work; for instance if your prior was a Student-t with df n1 and your posterior was a Student-t with df n2s1^2 then your calibration cannot be more than 1/(1-s1^2/s2^2) without having your posterior explode. It's tempting to say the lesson is that things break if you're becoming asymptotically less certain, which makes some intuitive sense: if your distributions are actually mixtures of finitely many different hypotheses that you're Bayesianly updating the weights of, then you will never become asymptotically less certain; in particular the Student-t scenario I described can't happen. However this is not a satisfactory conclusion because the Normal scenario (where you increase your variance by upweighting a hypothesis that gives higher variance) can easily happen.
A different resolution to the above is that the model of evidence=calibration*felt evidence is wrong, and needs an error term or two; that can give a workable result, or at least not catch fire and die.
Another thought: if your mental process is like the one two paragraphs up, where you're working with a mixture of several fixed (e.g. normal) hypotheses, and the calibration concept is applied to how you update the weights of the hypotheses, then the change in the mixture distribution (i.e. the marginal) will not follow anything like the calibration model.
So the concept is pretty tricky unless you carefully choose problems where you can reasonably model the mental inference, and in particular try to avoid "mixture-of-hypotheses"-type scenarios (unless you know in advance precisely what the hypotheses imply, which is unusual unless you construct the questions that way, .. but then I can't think of why you'd ask about the mixture instead of about the probabilities of the hypotheses themselves).
You might be okay when looking at typical multiple-choice questions; certainly you won't run into the issues with broken posteriors and invalid calibrations. Another advantage is that "the" prior (i.e. uniform) is uncontroversial, though whether the prior to use for computing calibration should be "the" prior is not obvious; but if you don't have before-and-after results from people then I guess it's the best you can do.
I just noticed that what's usually called the "likelihood" I was calling "evidence" here. This has probably been suggested by someone before, but: I've never liked the term "likelihood", and this is the best replacement for it that I know of.