Log of my attempts so far:
Attempt #1: note that, for any probability p, you can compute "number of predictions you made with probability less than p that came true". If you're perfectly-calibrated, then this should be a random variable with:
mean = sum(q for q in prediction_probs if q<p)
variance = sum(q*(1-q) for q in prediction_probs if q<p)
Let's see what this looks like if we plot it as a function of p. Let's consider three people:
Let's have each person make 1000 predictions with probabilities uniformly distributed in [0,1]; and then sample outcomes for each set of predictions and plot out their num-true-predictions-below functions. (The gray lines show the mean and first 3 stdev intervals for a perfectly calibrated predictor.)
Hrrm. The y-axis is too big to see the variation, Let's subtract off the mean.
And to get a feeling for how else this plot could have looked, let's run 100 more simulations for each the three people:
Okay, this is pretty good!
But it's not perfect: everything's too squished together on the left to see what's happening -- a predictor could be really screwing up their very-low-probability predictions and this graph would hide it. Possibly related to that squishing, I feel like the plot should be right-left symmetric, to reflect the symmetries of the predictors' biases. But it's not.
Attempt #2: the same thing, except instead of plotting
sum((1 if came_true else 0) for q in prediction_probs if q<p)
we plot
sum(-log(prob you assigned to the correct outcome) for q in prediction_probs if q<p)
i.e. we measure the total "surprisal" for all your predictions with probability under p. (I'm very fond of surprisal; it has some very appealing information-theory-esque properties.)
On the bright side, this plot has less overlap between the three predictors' typical sets of lines. And the red curves look... more symmetrical, kinda, like an odd function, if you squint. Same for the blue curves.
On the dark side, everything is still too squished together on the left. (I think this is a problem inherent to any "sum(... for q in prediction_probs if q<p)" function. I tried normalizing everything in terms of stdevs, but it ruined the symmetry and made everything kinda crazy on the left-hand side.)
There is the brier score, or any other proper scoring rule. These each have the advantage of being zero-degree-of-freedom up to the choice of scoring rule, though it isn't information preserving, and isn't comparable across different sets of predictions. (Though neither is any analogue of a CDF.)
The problem is that this measures their amount of knowledge about the questions as well as their calibration.
My model would be as follows. For a fixed source of questions, each person has a distribution describing how much they know about the questions. It describes how likely it is that a given question is one they should say p on. Each person also has a calibration function f, such that when they should say p they instead say f(p). Then by assigning priors over the spaces of these distributions and calibration functions, and applying Bayes' rule we get a...
Let be your estimation of propability of some event.
Let's define penalty if event happened, otherwise.
If is propability of event, expected penalty equals:
And expected penalty derivative equals
So, if you try to minimize average penalty, you are motivated to give your best possible estimate. It's possible that you can use it to grade your own predictions, I don't know for sure.
When I see people grading their predictions, it's always by: (a) bucketing their predictions by probability (into a "46-55%" bucket, a "56-75%" bucket, ...), and then (b) plotting each bucket's nominal probability vs empirical frequency-of-correctness. See e.g. Scott Alexander here.
This seems... fine... but the bucketing step has a certain inelegance to it: just as you can build many different-looking histograms for the same dataset, you can build many different-looking calibration curves for the same predictions, based on a semi-arbitrary choice of bucketing algorithm. Also, by bucketing datapoints together and then aggregating over the bucket, information is destroyed.
For histograms, there's an information-preserving, zero-degree-of-freedom alternative: the CDF. The CDF isn't perfect, but it at least has a different set of problems from histograms.
Is there any similar tool for grading predictions?