There's a whole subfield on "scoring rules", which try to more exactly measure people's calibration and resolution.
There's scoring rules that incorporate priors, scoring rules that incorporate information value to the question asker, and scoring rules that incorporate sensitivity to distance (if you're close to the answer, you get more points). There's a class of "strictly proper" scoring rules that incentivize people to give their true probability. I did a deep dive into scoring rules when writing the Verity whitepaper. Here are some of the more interesting/useful research articles on scoring rules:
Order-Sensitivity and Equivariance of Scoring Functions - PDF - arxiv.org: https://www.evernote.com/l/AAhfW6RTrudA9oTFtd-vY7lRj0QlGTNp4bI/
Tailored Scoring Rules for Probabilities: https://www.evernote.com/l/AAhVczys0ddF3qbfGk_s4KLweJm0kUloG7k/
Scoring Rules, Generalized Entropy, and Utility Maximization: https://www.evernote.com/l/AAh2qdmMLUxA97YjWXhwQLnm0Ro72RuJvcc/
The Wisdom of Competitive Crowds: https://www.evernote.com/l/AAhPz9MMSOJMcK5wrr8mQGNQtSOvEeKbdzc/
A formula for incorporating weights into scoring rules: https://www.evernote.com/l/AAgWghOuiUtIe76PQsXwFSPKxGv-VkzH7l8/
Sensitivity to Distance and Baseline Distributions in Forecast Evaluation: https://www.evernote.com/l/AAg7aZg9BjRDLYQ2vpGow-qqN9Q5XY-hvqE/
One thing you might look at is the Brier Score, particularly the 3-component decomposition.
Score = Reliability - Resolution + Uncertainty
The nice thing about this decomposition is that it gives you more information than a single score. The uncertainty is a sort of 'difficulty' score, it doesn't take predictions into account and is minimized when the same outcome occurs each time.
The resolution tells you how much information each prediction gives. For an event that occurs half of the time you could predict 0.5 probability for everything but if you knew more about what was going on then maybe you could predict a 1 or a 0. This is a much stronger statement so the resolution gives you credit for that.
Reliability is then much like the scoring metric you describe. It is minimized (which is good, since it's a loss score) when all of the events you predict with 0.2 occur 20% of the time; that is, when your predictions match the uncertainty.
All of this happens at arbitrary precision, it's just operations on real vectors so the only limit is your floating-point size.
Doesn't
n_k the number of forecasts with the same probability category
Indicate that this is using histogram buckets? I'm trying to say I'm looking for methods that avoid grouping probabilities into an arbitrary (chosen by the analyst) number of categories. For instance.. in the (possibly straw) histogram method that I discussed in the question, if a predictor makes a lot of 0.97 bets and no corresponding 0.93 bets, their [0.9 1] category will be called slightly pessimistic about its predictions even if those forecasts came true exactly 0.97 of the time, I wouldn't describe anything in that genre as exact, even if it is the best we have.
Yes, you should draw calibration curves without binning.
The calibration curve lives in a plane where the x-axis is the probability of the prediction and the y-axis is the actual proportion of outcomes. We can make each prediction live in this plane. A scored prediction is a pair of an outcome, b, either 0 or 1, and the earlier prediction p. The value p belongs on the x-axis. The value b is sort of like a value on the y-axis. Thus (p,b) makes sense on the x-y-plane. It is valuable to plot the scatterplot of this representation of the predictions. The calibration curve is a curve attempting to approximate this scatterplot. A technique for turning a scatter plot into the graph of a function is called a smoother. Every smoother yields a different notion of calibration curve. The most popular general-purpose smoother is loess and it is also the most popular smoother for the specific task of calibration curves without bins. Frank Harrell (2-28) suggests tweaking the algorithm, setting α=1.
Yes, there's math.
I asked this question a while ago at statistics.SE
Tedlock uses the Briers score both in his earlier research and in the good judgement project.
http://faculty.engr.utexas.edu/bickel/Papers/QSL_Comparison.pdf is also worth reading
It's important to note that accuracy and calibration are two different things. I'm mentioning this because the OP asks for calibration metrics, but several answers so far give accuracy metrics. Any proper scoring rule is a measure of accuracy as opposed to calibration.
It is possible to be very well-calibrated but very inaccurate; for example, you might know that it is going to be Monday 1/7th of the time, so you give a probability of 1/7th. Everyone else just knows what day it is. On a calibration graph, you would be perfectly lined up; when you say 1/7th, the thing happens 1/7th of the time.
It is also possible to have high accuracy and poor calibration. Perhaps you can guess coin flips when no one else can, but you are wary of your precognitive powers, which makes you underconfident. So, you always place 60% probability on the event that actually happens (heads or tails). Your calibration graph is far out of line, but your accuracy is higher than anyone else.
In terms of improving rationality, the interesting thing about calibration is that (as in the precog example) if you know you're poorly calibrated, you can boost your accuracy simply by improving your calibration. In some sense it is a free improvement: you don't need to know anything more about the domain; you get more accurate just by knowing more about yourself (by seeing a calibration chart and adjusting).
However, if you just try to be more calibrated without any concern for accuracy, you could be like the person who says 1/7th. So, just aiming to do well on a score of calibration is not a good idea. This could be part of the reason why calibration charts are presented instead of calibration scores. (Another reason being that calibration charts help you know how to adjust to increase calibration.)
That being said, a decomposition of a proper scoring rule into components including a measure of calibration, like Dark Denego gives, seems like the way to go.
On platforms like PredictionBook, a user's credibility will be measured by looking at a histogram of all of their predictions (edit: After being told about "brier score", I have become able to notice that PredictionBook totally shows a brier score right under the histogram. Still, thinking about brier scores, they're not very good either. You can only maximise your brier score by betting on things you're certain of, it creates perverse incentives to avoid placing bets on uncertain events, which we already have enough of in natural discourse). For a perfectly well calibrated agent, exactly 90% of their 0.9 predictions should have come true, exactly 20% of their 0.2 predictions should have come true, etc. We can reduce a person's calibration to a number by, I suppose, getting the sum of squares of the differences of each histogram from its middle point (I don't know if that's how they do it).
But whatever they do with the histogram, I think I'll find it very unsatisfying, because the result are going to depend on what spacing we used for the histogram segments, and that's arbitrary. We could have a histogram with ten segments, or a histogram with fifty segments, we'll get different scores. I feel like there must be some exact.. continuous way of scoring a predictor's calibration, and that the math will probably be generally useful for other stuff. Is there a method?