What's wrong with traditional error bars or some equivalent thereof?
If you are forecasting boolean events, you can either calculate the expected dispersion analytically (if you're lazy) or just simulate it (if you're lazy in a different way). Plot it to taste (error bars, boxplots, watercolour plots, etc.) and then overlay actual results.
There is no reason to restrict ourselves to a simple set of lines or bars, is there?
What's wrong with traditional error bars or some equivalent thereof?
Do you mean just adding error bars (which indicate amounts of noise in each sample) to the traditional-style bar graph? It so, it doesn't influence most of the drawbacks and benefits that I mention in the post (such as e.g. that you restrict yourself to a small set of predefined confidence levels, and that you have discontinuities in the result when your data changes slightly).
which indicate amounts of noise in each sample
Nope.
For each confidence level there is a distribution of actual outcomes that you'd expect. You can calculate (or simulate) it for any confidence level so you are not restricted to a small predefined set. This is basically your forecast: you are saying "I have made n predictions allocated to the X confidence bucket and the distribution of successes/failures should look like this". Note: it's a distribution, not a scalar. There is also no noise involved.
You plot these distributions in any form you like and then you overlay your actual number of successes and failures (or their ratio if you want to keep things simple).
For each confidence level there is a distribution of actual outcomes that you'd expect
X confidence bucket and the distribution
Yes sure, but if you do this for each confidence level separately than you already gave up on most improvements that are possible with my proposed method.
The most serious problem with the traditional method is not nicely showing the distribution in each bucket, but the fact that there are separate buckets at all.
I am probably not explaining myself clearly.
Your buckets can be as granular as you want. For the sake of argument, let's set them at 1% -- so that you have a 50% bucket, a 51% bucket, a 52% bucket, etc. Each bucket drives the same binomial distribution, but with a slightly different probability, so the distributions are close. If you plot it, it's going to be nice and smooth, no discontinuities.
Some, or maybe even most, of your buckets will be empty. That's fine, we'll interpolate values into them because one of our prior assumptions is smoothness -- we don't expect the calibration for the 65% bucket to be radically different from the calibration for the 66% bucket. We'll probably have a tuning parameter to specify how smooth would we like things to be.
So what's happening with the non-empty buckets? For each of them we can calculate the error which we can define as, say, the difference between the actual and the expected ratio of successes to predictions (it can be zero). This gives you a set of points which, if plotted, will be quite spiky since a lot of buckets might have a single observation in them. At this point you recall our assumption of smoothness and run some sort of a kernel smoother on your error points (note that they should be weighted by the number of observations in a bucket). If you just want a visual representation, you plot that smoothed line and you're done.
If you want some numerical aggregate of this, a simple number would be the average error (note that we kept the sign of the error, so errors in different directions will cancel each other out). To make it a bit more meaningful, you can make a simulation of the perfect forecaster: run, say, 1000 iterations of a set of predictions where for each prediction in the X% bucket the perfect forecaster predicts True X% of the time and False (100-X)% of the time. Calculate the average error in each iteration, form an empirical distribution out of these average errors, and see where your actual average error falls.
If you want not a single aggregate, but estimates for some specific buckets (say, for 50%, 70%, and 90%), look at your kernel smoother line and read the value off it.
I'm not a fan of the traditional method -- I am particularly unenthusiastic about the way it depends on allowing only a limited number of specific probability estimates -- but I could do with a little more information and/or persuasion before being convinced that this proposal is Doing It Right.
If I have one of your graphs, how do I (1) quantify (if I want to) how well/badly I'm doing and (2) figure out what I need to change by how much?
Consider the graph you took from Slate Star Codex (incidentally, you have a typo -- it says "Start" in your post at present). If I'm Scott looking at that graph, I infer that maybe I should trust myself a little more when I feel 70% confident of something, and that maybe I'm not distinguishing clearly between 70% and 80%; and that when I feel like I have just barely enough evidence for something to mention it as a 50% "prediction", I probably actually have a little bit more. And, overall, I see that across the board I'm getting my probabilities reasonably close, and should probably feel fairly good about that.
(Note just in case there should be the slightest doubt: I am not in fact Scott.)
On the other hand, if I'm Scott looking at this
which is, if I've done it right, the result of applying your approach to his calibration data ... well, I'm not sure what to make of it. By eye and without thinking much, it looks as if it gets steadily worse at higher probabilities (which I really don't think is a good description of the facts); since it's a cumulative plot, perhaps I should be looking at changes in the gap sizes, in which case it correctly suggests that 0.5 is bad, 0.6 isn't, 0.7 is, 0.8 isn't ... but it gives the impression that what happens at 0.9-0.99 is much worse than what happens at lower probabilities, and I really don't buy that. And (to me) it doesn't give much indication of how good or bad things are overall.
Do you have some advice on how to read one of your graphs? What should Scott make of the graph shown above? Do you think the available evidence really does indicate a serious problem around 0.9-0.99?
I also wonder if there's mileage in trying to include some sort of error bars, though I'm not sure how principled a way there is to do that. For instance, we might say "well, for all we know the next question of each type might have gone either way" and plot corresponding curves with 1 added to all the counts:
The way the ranges overlap at the right-hand side seems to me to back up what I was saying above about the data not really indicating a serious problem for probabilities near 1.
First of all, thanks for your insightful reply. Let me know if you feel that I haven't done it justice with my counter-reply below.
I'll start by pointing out that you got the graph slightly wrong, when you included some 50% predictions in one curve and some in the other. Instead, include all of them in both or in none (actually, it would be even better to force-split them in perfect halves). I neglected to point this out in my description, but obviously 50% predictions aren't any more "failed" than "successful" [edit: fixed now].
Here's my version:
It's still pretty similar to what you made, and your other concerns remain valid.
As you correctly noted, the version of the graph I'm proposing is cumulative. So at any point on the horizontal axis, the amount of divergence between the two lines in telling you how badly you are doing, with your predictions up to that level of confidence.
Looking at changes in gap sizes has more noise problem. So with a lot of data - sure, you can look at the graph and say "the gap seems to becoming larger the fastest around 70%, so I'm underconfident around that range". But this is pretty weak.
If instead, I'd look at the graph above and say: "the gap between lines grows to around 20 expected predictions by 90%, so I can be pretty certain that I'm underconfident, and it seems to be happening somewhere in the 65-90% bracket"... than this is based on much more data points, and I also have a quantifiable piece of information that I'm missing around 20 additional hypothetical failed predictions to be perfectly calibrated.
Also, from my graph, the 70% range does not appear very special. It looks more like the whole area around 70%-90% data points has a slight problem, and I would conclude that I have not enough information to say if this is happening specifically at the 70% confidence level. So the overall result is somewhat different from the "traditional" method.
As for the values above 90%, I'd simply ignore them - there are too few of them for the result to be significant. Your idea with brackets or error bars might help to visualize that in the high ranges, you'd need much more data to get significant results. I am of course happy with people adding them whenever they think they are helpful - or simply cutting the graph at some reasonable value like 0.9 if they have not much data.
[Note: maybe someone could alert Scott to this discussion?]
I'm not sure I'm convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half "right" and half "wrong" (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
the 70% range does not appear very special. It looks more like the whole area around 70% to 90% data points has a slight problem
Interesting observation. I'm torn between saying "no, 70% really is special and your graph misleads by drawing attention to the separation between the lines rather than how fast it's increasing" and saying "yes, you're right, 70% isn't special, and the traditional plot misleads by focusing attention on single probabilities". I think adjudicating between the two comes down to how "smoothly" it's reasonable to expect calibration errors to vary: is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not? If we had an agreed answer to that question, actually quantifying our expectations of smoothness, then we could use it to draw a smooth estimated-calibration curve, and it seems to me that that would actually be a better solution to the problem.
Relatedly: your plot doesn't (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there. Hmm, perhaps if we used a logarithmic scale for the vertical axis or something it would help with that, but I think that would make it more confusing.
I'm not sure I'm convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half "right" and half "wrong" (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you'll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you'd need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not?
Yes, this appears to be the crux here. My intuitive prior is against this "single blind spot" theory, but I don't have any evidence beyond Occam's razor and what I tend to observe in statistics of my personal prediction results.
Relatedly: your plot doesn't (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there.
I'm not sure what exactly you think it doesn't distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
is in fact more evidence for miscalibration
Yes, but it's not evidence for more miscalibration, and I think "how miscalibrated?" is usually at least as important a question as "how sure are we of how miscalibrated?".
Sure. So "how miscalibrated" is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it's simply how far they appear to be visually.
adjust the scales of graphs to make them the same size
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use
No, you don't always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don't know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don't currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).
If this is a valid method of scoring predictions, some mathematician or statistician should have already published it. We should try to find this publication rather than inventing methods from scratch.
If this is a valid method of scoring predictions, some mathematician or statistician should have already published it.
Wow, such learned helplessness! Do not try to invent anything new since if it's worthwhile it would have been invented before you... X-0
It's unlikely that an amateur could figure out something that trained mathematicians have been unable to figure out, especially in cases where the idea has potential failure modes that are not easily visible. It's like trying to roll your own cryptography, something else that amateurs keep doing and failing at.
One thing that stories of backyard inventors have in common is that the field in which they were doing backyard invention was so new that the low hanging fruits hadn't all been harvested. You aren't Thomas Edison, or even Steve Wozniak; mathematics and statistics are well-studied fields.
I continue to be surprised by the allegiance to the high-priesthood view of science on LW -- and that in place founded by a guy who was trained in... nothing. I probably should update :-/
In any case, contemplate this: "People who say it cannot be done should not interrupt those who are doing it."
and that in place founded by a guy who was trained in.. nothing
I don't consider Eliezer to be an authority on anything other than maybe what he meant to write in his fanfics. I'm as skeptical of his work on AI risk as I am on this.
I don't consider Eliezer to be an authority
Maybe the right question isn't who is or is not an authority, but rather who writes/creates interesting and useful things?
You were suggesting that it was inconsistent to defer to experts on science, but use a site founded by someone who isn't an expert. I replied that I don't consider the site founder someone to defer to.
You were suggesting that it was inconsistent to defer to experts on science, but use a site founded by someone who isn't an expert.
No, I wasn't.
I pointed out that I find deferring to experts much less useful than you do, and that I am surprised to find high-priesthood attitudes persisting on LW. My surprise is not an accusation of inconsistency.
That only answers half the objection. Being a mathematician means that it is possible for you to solve such problems (if you are trained in the proper area of mathematics anyway--"mathematics" covers a lot of ground), but the low hanging fruit should still be gone. I'd expect that if the solution was simple enough that it could fit in a blogpost, some other person who is also a trained mathematician would have already solved it.
I think you're approaching this in the wrong frame of reference.
No one is trying to discover new mathematical truths here. The action of constructing a particular metric (for evaluating calibration) is akin to applied engineering -- you need to design something fit for a specific purpose while making a set of trade-offs in the process. You are not going to tell some guy in Taiwan designing a new motherboard that he's silly and should just go read the academic literature and do what it tells him to do, are you?
I endorse this (while remarking that both Lumifer and I have -- independently, so far as I know -- suggested in this discussion that a better approach may be simply to turn the observed prediction results into some sort of smoothed/interpolated curve and plot that rather than the usual bar chart).
Let me make a more concrete suggestion.
Step 0 (need be done only once, but so far as I know never has been): Get a number of experimental subjects with highly varied personalities, intelligence, statistical sophistication, etc. Get them to make a lot of predictions with fine-grained confidence levels. Use this to estimate how much calibration error actually varies with confidence level; this effectively gives you a prior distribution on calibration functions.
Step 1 (given actual calibration data): You're now trying to predict a single calibration function. Each prediction-result has a corresponding likelihood: if something happened that you gave probability p to, the likelihood is simply f(p) where f is the function you're trying to estimate; if not, the likelihood is 1-f(p). So you're trying to maximize the sum of log f(p) over successful predictions + the sum of log [1-f(p)] over unsuccessful predictions. So now find the posterior-maximizing calibration function. (You could e.g. pick some space of functions large enough to have good approximations to all plausible calibration functions, and optimize over a parameterization of that space.) You can figure out how confident you should be about the calibration function by sampling from the posterior distribution and looking at the resulting distribution of values at any given point. If what you have is lots of prediction results at each of some number of confidence levels, then a normal approximation applies and you're basically doing Gaussian process regression or kriging, which quite cheaply gives you not only a smooth curve but error estimates everywhere; in this case you don't need an explicit representation of the space of (approximate) permissible calibration functions.
[EDITED: I wrote 1-log where I meant log 1- and have now fixed this.]
If it's done, then someone has to do it first. This sort of calibration measurement isn't (so far as I know) a thing traditionally emphasized in statistics, so it wouldn't be super-surprising if the LW community were where the first solution came from.
(But, as mentioned in other comments on this post, I am not in fact convinced that SiH's approach is the Right Thing. For what it's worth, I am also a trained mathematician.)
I had a look at the literature on calibration and it seems worse than the work done in this thread. Most of the research on scoring rules has been done by educators, psychiatrists, and social scientists. Meanwhile there are several trained mathematicians floating around LW.
Also, I'm not sure if anyone else realises that this is an important problem. To care about it you have to care about human biases and Bayesian probability. On LW these are viewed as just two sides of the rationality coin, but in the outside world people don't really study them at the same time.
I have looked on Google Scholar. I could find several proposed measures of calibration. But none are very good; they're all worse than the things proposed in this thread.
A good calibration measure might be the ratio between your score (using the logarithmic scoring rule, say) and the score of your predictions transformed by whichever monotonic function makes your score greatest (I suppose this function should also be symmetric about 50%). To put that in words: "What proportion of your score cannot possibly be explained by miscalibration rather than ignorance?"
For example Scott's predictions were
Of 50% predictions, I got 8 right and 5 wrong, for a score of 62%
Of 60% predictions, I got 12 right and 9 wrong, for a score of 57%
Of 70% predictions, I got 13 right and 3 wrong, for a score of 81%
Of 80% predictions, I got 13 right and 3 wrong, for a score of 81%
Of 90% predictions, I got 16 right and 1 wrong, for a score of 94%
For 95% predictions, I got 9 right and 0 wrong, for a score of 100%
For 99% predictions, I got 3 right and 0 wrong, for a score of 100%
So he wishes that instead of 60% he had said 57%, instead of 70% and 80% he had said 81%, instead of 90% he had said 94% and instead of 95% and 99% he had said 100%. This would have improved his log score from -0.462 per question to -0.448 per question. So his calibration is -0.448/-0.462 = 97.2%, which is presumably very good.
EDIT: I'd say now that it might be better to take the difference rather than the ratio. Otherwise you'll look better calibrated on difficult problems just because your score will be worse overall.
Your thinking seems to be going in a good direction, and I like the idea generally. However....
So his calibration is -0.448/-0.462 = 97.2%, which is presumably very good.
BAM! The operation of taking a ratio of logarithms here... well... you are actually just https://en.wikipedia.org/wiki/List_of_logarithmic_identities#Changing_the_base . Give it some more thought please?
Are you saying that you think I'm doing something that isn't mathematically legitimate? I don't think I am.
I'm taking a sum of a bunch of logarithms and then dividing that by the sum of another bunch of logarithms. In the calculations above I used base e logarithms everywhere. Perhaps you are worried that the answer 97.2% depends on the base used. But if I had changed base to base x then the change of base formula says that I would have to divide the logarithms by log(x) (i.e. the base e logarithm of x). Since I would be dividing both the top and the bottom of my fraction by the same amount the answer would be unchanged.
Or perhaps you are worried that the final answer doesn't actually depend on the calibration? Let's pretend that for Scott's 90% predictions (of which he got 94% correct) he had actually said 85%. This would make his calibration worse, so lets see if my measure of his calibration goes down. His log score has worsened from -0.462 to -0.467. His "monotonically optimised" score remains the same (he now wishes he had said 94% instead of 85%) so it is still -0.448. Hence his calibration has decreased from -0.448/-0.462 = 97.2% to -0.448/-0.467 = 96.0%. This shows that his calibration has in fact got worse, so everything seems to be in working order.
EDIT: What the change-of-base formula does show is that my "calibration score" can also be thought of as the logarithm of the product of all the probabilities I assigned to the outcomes that did happen taken to the base of the product of the monotonically improved versions of the same probabilities. That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
Perhaps you are worried that the answer 97.2% depends on the base used.
Or perhaps you are worried that the final answer doesn't actually depend on the calibration?
No and no. I know all those things that you wrote.
That seems to me to be a confusing way of looking at it, but it still seems mathematically legit.
It's "legit" in the sense that the operation is well defined... but it's not doing the work you'd like it to do.
Your number is "locally" telling you which direction is towards more calibration, but is not meaningful outside of the particular configuration of predictions. And you already can guess that direction. What you need is to quantify something that is meaningful for different sets of predictions.
Example:
If made 3 predictions at 70% confidence, 2 failed and 1 was correct.
Mean log score: (ln(0.3)*2 + ln(0.7)) / 3 = -0.92154
If I said 33% confidence: (ln(0.67)*2 + ln(0.33)) / 3 = -0.63654
Your score is: 69%
If made 3 predictions such that 1 failed and 1 was correct at 70%, and 1 failed at 60%:
Mean log score: (ln(0.3) + ln(0.7) + ln(0.4)) / 3 = -0.82565
If I said 50% instead of 70%, and 0% instead of 60%: (ln(0.5)*2 + ln(1)) / 3 = -0.462098
Your score is: 56%
Have you noticed that failing a prediction at 60% is clearly better than failing the same prediction at 70%?
However, your score is less in the former case.
Please forgive me if I sound patronizing. But inventing scoring rules is a tricky business, and it requires some really careful thinking.
Okay I understand what you're saying now.
It will take me a while to think about this in more detail, but for now I'll just note that I was demanding that we fix 50% at 50%, so 60% can't be adjusted to 0% but only down to 50%. So in the second case the score is log(0.5)*3/(log(0.3)+log(0.7)+log(0.4)) = 84.0% which is higher.
I think my measure should have some nice properties which justify it, but I'll take a while to think about what they are.
EDIT: I'd say now that it might be better to take the difference rather than the ratio. Otherwise you'll look better calibrated on difficult problems just because your score will be worse overall.
This is a distinct problem from my (1) since every field has things that are nearly 100% certain to happen.
I think that's because your #1 is actually two distinct problems :-). You mentioned two classes of prediction whose result would be uninformative about calibration elsewhere: easy near-certain predictions, and straightforward unambiguous probability questions. (Though ... I bet the d10 results would be a little bit informative, in cases where e.g. the die has rolled a few 1s recently. But only via illuminating basic statistical cluefulness.) I agree that there are some of the former in every field, but in practice people interested in their calibration don't tend to bother much with them. (They may make some 99% predictions, but those quite often turn out to be more like 80% predictions in reality.)
this approach gets the wrong sign [...] would count as two failed predictions when it is effectively just one.
It gets the right sign when the question is "how do we use these calibration numbers to predict future reliability?" (if someone demonstrates ability in guessing the next president, that confers some advantages in guessing the next supreme court judge) and the wrong sign when it's "how do we use these prediction results to estimate calibration?". I agree that it would be useful to have some way to identify when multiple predictions are genuinely uninformative because they're probing the exact same underlying event(s) rather than merely similar ones. In practice I suspect the best we can do is try to notice and adjust ad hoc, which of course brings problems of its own.
(There was a study a little while back that looked at a bunch of pundits and assessed their accuracy, and lo! it turned out that the lefties were more reliable than the righties. That conclusion suited my own politics just fine but I'm pretty sure it was at least half bullshit because a lot of the predictions were about election results and the like, the time of the study was a time when the left, or what passes for the left in the US context, was doing rather well, and pundits tend to be overoptimistic about political people on their side. If we did a similar study right now I bet it would lean the other way. Perhaps averaging the two might actually tell us something useful.)
I think #1 is better understood as follows: you can be differently calibrated (at a given confidence level) for different kinds of question. Most of us are very well calibrated for questions about rolling dice. We may be markedly better or worse -- and differently -- calibrated for questions about future scientific progress, about economic consequences of political policies, about other people's aesthetic preferences, etc.
That suggests a slightly different remedy for #1 from yours: group predictions according to subject matter. (More generally, if you're interested in knowing how well someone calibrated for predictions of a certain kind, we could weight all their predictions according to how closely related we think they are to that. Grouping corresponds to making all the weights 0 or 1.) This will also help with #2: if someone was 99% confident that Clinton would win the election then that mistake will weigh heavily in evaluating their calibration for political and polling questions. [EDITED to add: And much less heavily in evaluating their calibration for dice-rolling or progress in theoretical physics.]
Talk of the Clinton/Trump election brings up another issue not covered by typical calibration measures. Someone who was 99% confident Clinton would win clearly made a big mistake. But so, probably, did someone who was 99% confident that Trump would win. (Maybe not; perhaps there was some reason why actually he was almost certain to win. For instance, if the conspiracy theories about Russian hacking are right and someone was 99% confident because they had inside knowledge of it. If something like that turns out to be the case then we should imagine this replaced with a better example.)
Sometimes, but not always, when an outcome becomes known we also get a good idea what the "real" probability was, in some sense. For instance, the Clinton/Trump election was extremely close; unless there was some sort of foul play that probably indicates that something around 50% would have been a good prediction. When assessing calibration, we should probably treat it like (in this case) approximately 50% of a right answer and 50% of a wrong answer.
(Because it's always possible that something happened for a poorly-understood reason -- Russian hacking, divine intervention, whatever -- perhaps we should always fudge these estimates by pushing them towards the actual outcome. So, e.g., even if the detailed election results suggest that 50% was a good estimate for Clinton/Trump, maybe we should use 0.7/0.3 or something instead of 0.5/0.5.)