There are lots of scoring rules for probability assessments. Log scoring is popular here, and squared error also works.

But if I understand these correctly, they are combined measurements of both domain-ability and calibration. For example, if several people took a test on which they had to estimate their confidence in their answers to certain true or false questions about history, then well-calibrated people would have a low squared error, but so would people who know a lot about history.

So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.

The only pure calibration estimates I've ever seen are calibration curves in the form of a set of ordered pairs, or those limited to a specific point on the cuve (eg "if ey says ey's 90% sure, ey's only right 60% of the time"). There should be a way to take the area under (or over) the curve to get a single value representing total calibration, but I'm not familiar with the method or whether it's been done before. Is there an accepted way to get single-number calibration scores separate from domain knowledge?

New Comment
14 comments, sorted by Click to highlight new comments since:

But if I understand these correctly, they are combined measurements of both domain-ability and calibration.

You understand correctly, though I would say "accuracy" rather than "domain-ability".

So (I think) someone who always said 70% confidence and got 70% of the questions right would get a higher score than someone who always said 60% confidence and got 60% of the questions right, even though they are both equally well calibrated.

This is also correct. A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration. A subject whose only goal was to get a calibration score would do this. More generally, multiple choice questions can be answered with maxent probability distributions, achieving the same result. Open ended questions are harder to game, but it is also harder to figure out the probability assigned to the correct answer to use to compute the score.

One approach I considered is asking for confidence intervals that have a given (test giver specified) probability of containing the correct numerical answer, however this is also gameable, by using a mix of the always correct interval from negative infinity to positive infinite and the always incorrect empty interval to achieve the target success rate.

Though I don't think it is much of a problem that scoring rules represent a mix of calibration and accuracy, as it is this mix that determines a person's ability to report useful probabilities.

A problem with trying to isolated calibration is that on the true/false test, the subject could always assign 50% probability to both true and false and be right 50% of the time, achieving perfect calibration.

Interestingly, this problem can be avoided by taking the domain of possible answers to be the natural numbers, or n-dimensional Euclidean space, etc., over which no uniform distribution is possible, and then asking your test subject to specify a probability distribution over the whole space. This is potentially impractical, though, and I'm not certain it can't be gamed in other ways.

I do not think I am quite addressing your question. Specifically, I don't think there has been a wide enough discussion about calibration for there to be a single widely accepted method.

However, what I would like to point out is that a single-number calibration necessarily discards information, and there is no one true way to decide which information to discard.

A gets binary questions right 98% of the time, but expects to get them correct 99% of the time. B gets binary questions right 51% of time time, but expects to get them correct 52% of the time.

In some cases, A and B must be treated as equally calibrated (Zut Allais! is relevant). In some cases, B can be considered much better calibrated, and in almost all cases we don't care either way, because B's predictions are almost never useful, whereas A's almost always are.

Even this is a dramatic simplification, painfully restricting our information about the situation. Perhaps A never has false positives; or maybe B never has false positives! This is extremely relevant to many questions, but can't be represented in any single-number metric.

No matter what your purpose, domain knowledge matters, and I suspect that calibration does not carry over well from one domain to another, so finding out that you know little history but are well calibrated to how poorly you know things will not help you evaluate how reliable your predictions in your primary field are.

Binary questions are usually already horribly under-sampled. We can ask binary questions about history, but it probably matters in the real world whether your answer was 2172 or 1879 if the correct answer was 1880. Ideally, we could provide a probability distribution for the entire range of incorrectness, but in practice, I think the best measure is to report the false positive and false negative rate of an agent on a set of questions along with their own estimates for their performance on those questions. I realize this is four times as many numbers as you want, but you can then condense them however you like, and I really think that the 4-tuple is more than four times more useful than any single-number measure!

Do you have a more specific purpose in mind? I'm curious what spurred your question.

Do you have a more specific purpose in mind? I'm curious what spurred your question.

A prof doing an experiment gave me a bunch of data from calibration tests with demographic identifiers, and I'd like to be able to analyze it to say things like "Old people have better calibration than young people" or "Training in finance improves your calibration".

Oh, excellent. I do love data. What is the format (what is the maximum amount of information you have about each individual)?

Given that you already have the data, (and you probably have reason to suspect that individuals were not trying to game the test?), I suspect the best way is to graph both accuracy and anticipated accuracy against the chosen demographic, and then for all your readers who want numbers, compute either the ratio or the difference of those two and publish the PMCC of that against the demographic (it's Frequentist, but it's also standard practice, and I've had papers rejected that don't follow it...).

[-]Cyan00

...PMCC...

I'm not sure what the Pacific Mennonite Children's Choir has to do with it... oh wait, nevermind.

Leaving them with two separate metrics would allow you to make interesting statements like "financial training increased accuracy, but it also decreased calibration. Subjects overestimated their ability."

tl;dr : miscalibration means mentally interpreting loglikelihood of data as being more or less than its actual loglikelihood; to infer it you need to assume/infer the Bayesian calculation that's being made/approximated. Easiest with distributions over finite sets (i.e. T/F or multiple-choice questions). Also, likelihood should be called evidence.

I wonder why I didn't respond to this when it was fresh. Anyway, I was running into this same difficulty last summer when attempting to write software to give friendly outputs (like "calibration") to a bunch of people playing the Aumann game with trivia questions.

My understanding was that evidence needs to be measured on the logscale (as the difference between prior and posterior), and miscalibration is when your mental conversion from gut feeling of evidence to the actual evidence has a multiplicative error in it. (We can pronounce this as: "the true evidence is some multiplicative factor (called the calibration parameter) times the felt evidence".) This still seems like a reasonable model, though of course different kinds of evidence are likely to have different error magnitudes, and different questions are likely to get different kinds of evidence, so if you have lots of data you can probably do better by building a model that will estimate your calibration for particular questions.

But sticking to the constant-calibration model, it's still not possible to estimate your calibration from your given confidence intervals because for that we need an idea of what your internal prior (your "prior" prior, before you've taken into account the felt evidence) is, which is hard to get any decent sense of, though you can work off of iffy assumptions, such as assuming that your prior for percentage answers from a trivia game is fitted to the set of all the percentage answers from this trivia game, and has some simple form (e.g. Beta). The Aumann game gave an advantage in this respect, because rather than comparing your probability distribution before&after thinking about the question, it makes it possible to compare the distribution before&after hearing other people's arguments&evidence; if you always speak in terms of standard probability distributions, it's not too hard to infer your calibration there.

Further "funny" issues can arise when you get down to work; for instance if your prior was a Student-t with df n1 and your posterior was a Student-t with df n2s1^2 then your calibration cannot be more than 1/(1-s1^2/s2^2) without having your posterior explode. It's tempting to say the lesson is that things break if you're becoming asymptotically less certain, which makes some intuitive sense: if your distributions are actually mixtures of finitely many different hypotheses that you're Bayesianly updating the weights of, then you will never become asymptotically less certain; in particular the Student-t scenario I described can't happen. However this is not a satisfactory conclusion because the Normal scenario (where you increase your variance by upweighting a hypothesis that gives higher variance) can easily happen.

A different resolution to the above is that the model of evidence=calibration*felt evidence is wrong, and needs an error term or two; that can give a workable result, or at least not catch fire and die.

Another thought: if your mental process is like the one two paragraphs up, where you're working with a mixture of several fixed (e.g. normal) hypotheses, and the calibration concept is applied to how you update the weights of the hypotheses, then the change in the mixture distribution (i.e. the marginal) will not follow anything like the calibration model.

So the concept is pretty tricky unless you carefully choose problems where you can reasonably model the mental inference, and in particular try to avoid "mixture-of-hypotheses"-type scenarios (unless you know in advance precisely what the hypotheses imply, which is unusual unless you construct the questions that way, .. but then I can't think of why you'd ask about the mixture instead of about the probabilities of the hypotheses themselves).

You might be okay when looking at typical multiple-choice questions; certainly you won't run into the issues with broken posteriors and invalid calibrations. Another advantage is that "the" prior (i.e. uniform) is uncontroversial, though whether the prior to use for computing calibration should be "the" prior is not obvious; but if you don't have before-and-after results from people then I guess it's the best you can do.

I just noticed that what's usually called the "likelihood" I was calling "evidence" here. This has probably been suggested by someone before, but: I've never liked the term "likelihood", and this is the best replacement for it that I know of.

I ran into the same problem a while back and became frustrated that there wasn't an elegant answer. It should at least be possible to unambiguously spot under- and over-confidence, but even this is not clear.

I guess we need to define exactly what we're trying to measure and then treat it as an estimation problem, where each response is a data point stochastically generated from some hidden "calibration" parameter. But this is rife with mind-projection-fallacy pitfalls because the respondents' reasoning processes and probability assignments need to be treated as objective parts of the world (which they are, but it's just hard not to keep it all from becoming confused).

[-]Cyan00

I wrote a post on a related topic that may or may not prove useful to you: Calibration for continuous quantities. You could extend the histogram method described therein with a score based on a frequentist test of model fit such as the Kolmogorov-Smirnov test.

It's my position that calibration is fundamentally a frequentist quantity -- barring events of epistemic probability zero, a Bayesian agent could only ever consider itself unlucky, not poorly calibrated, no matter how wrong it was.

Wouldn't an observed mismatch between assigned probability and observed probability count as Bayesian evidence towards miscalibration?

I think part of the trouble is that its very difficult to comment meaningfully on the calibration of a single estimate without background information.

For example, suppose Alice and Bob each make one prediction, with confidence of 90% and 80% respectively, and both turn out to be right. I'd be happy to say that it seems like Alice is so far the better predictor of the two (although I'd be prepared to revise this estimate with more data) but its much harder for me to say who is better calibrated without some background information about what sort of evidence they were working from.

With that in mind, I don't think you're likely to find something as convenient as log scoring, though there are a few less mathematically elegant solutions that only work when you have a reasonably large set of predictions to test for calibration (I don't know if this is rigorous to help you). Both of these only work for binary true/false predictions but can probably be generalised to other uses.

You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.

Another would be to examine what happens to their score when you multiply the log odds of every estimate by a constant. Multiplying by a constant greater than one will move estimates towards 0 and 1 and away from 50%, while a constant less than one will do the opposite. Find the constant which maximises their score, if its significantly less than 1 they're overconfident, if its significantly more than 1 their underconfident, and if its roughly equal to 1 they're well calibrated.

You could check what proportion of the time they are right, calculate what their log score would have been if they had used this as their confidence for every prediction, and compare this to the score hey actually got.

Someone who is perfectly calibrated and doesn't always give the same confidence, will have a better log score than someone who gives the same series of guesses all using the mean accuracy as confidence. So the latter can't be used as a gold standard.

That's actually intentional. I think that if someone is right 90% of the time in some subjects but only right 60% of the time in others, they are better calibrated if they give the appropriate estimate for each subject than if they just give 75% for everything.