I don't understand what question you're trying to answer here. e.g., I could imagine questions of the form "If I see someone forecast 39.845%, should I treat this nearly the same as if they had forecast 40%?" Most mathematical ways of dealing with forecasts do (e.g. scoring rules, formulas for aggregating multiple people's forecasts, decision rules based on EV). But that doesn't seem like what you're interested in here.
Also not sure if/how these graphs differ from what they'd look like if they were based on forecasts which were perfectly calibrated with arbitrary precision.
Also, the description of Omar sounds impossible. If (e.g.) his 25% forecasts come true more often than his 5% forecasts, then his 15% forecasts must differ from at least one of those two (though perhaps you could have too little data to be able to tell).
methods based on rounding probabilities are hot flaming garbage
I think this depends a lot on what you're interested in, i.e. what scoring rules you use. Someone who runs the same analysis with Brier instead of log-scores might disagree.
More generally, I'm not convinced it makes sense to think of "precision" as a constant, let alone a universal one, since it depends on
I believe that these approaches are not good: For small datasets they produce large oscillations in the score, not smooth declines, and they improve the scores of worse-than-random forecast datasets.
I don't think it's very counterintuitive/undesirable for (what, in practice, is essentially) noise to make worse-than-random forecasts better. As a matter of fact, this also happens if you replace log-scores with Brier in your analysis with random noise instead of rounding.
Also, regarding oscillations: I don't think properties of "precision" obtained from small datasets are too important, for similar reasons why I usually don't pay a lot of attention to calibration plots obtained from a handful of forecasts.
As we increase the perturbation, the score falls ~monotonically (which I conjecture to always be true in the limit of infinitely many samples)
This conjecture is true and should easily generalise to more general 1-parameter families of centered, symmetric distributions admitting suitable couplings (e.g. additive N(0,\sigma^2) noise in log-odds space) using the fact that log(sigmoid(x+y))+log(sigmoid(x-y)) is decreasing in y for all log-odds x and all positive y (QED).
(NB: This fails when replacing log-scores with Brier.)
Rounding very strongly rounds everything to 50%, so with strong enough rounding every dataset has the same score.
I could make a similar argument for the noise-based version, if I chose to use Brier (or any other scoring rule S that depends only on |p-outcome| and converges to finite values as p tends towards 0 and 1): With sufficiently strong noise, every forecast becomes ≈0% and ≈100% with equal probability, so the expected score in the "large noise limit" converges to (S(0, outcome) + S(1, outcome))/2.
Hmmm, I’m still thinking about this. I’m kinda unconvinced that you even need an algorithm-heavy approach here. Let’s say that you want to apply logit, add some small amount of noise, apply logistic, then score. Consider the function on R^n defined as (score function) composed with (coordinate-wise logistic function). We care about the expected value of this function with respect to the probability measure induced by our noise. For very small noise, you can approximate this function by its power series expansion. For example, if we’re adding iid Gaussian noise, then look at the second order approximation. Then in the limit as the standard deviation of the noise goes to zero, the expected value of the change is some constant (something something Gaussian integral) times the Laplacian of our function on R^n times the square of the standard deviation. Thus the Laplacian is very related to this precision we care about (it basically determines it for small noise). For most reasonable scoring functions, the Laplacian should have a closed-form solution. I think that gets you out of having to simulate anything. Let me know if I messed anything up! Cheers!
Awesome post! I'm very ignorant of the precision-estimation literature so I'm going to be asking dumb questions here.
First of all, I feel like a precision function should take some kind of "acceptable loss" parameter. From what I gather, to specify the precision you need some threshold in your algorithm(s) for how much accuracy loss you're willing to tolerate.
More fundamentally, though, I'm trying to understand what exactly we want to measure. The list of desired properties of a precision function feel somewhat pulled out of thin air, and I'd feel more comfortable with a philosophical understanding of where these properties come from. So let's say we have a set of possible states/trajectories of the world, the world provides us with some evidence , and we're interested in for some event . Maybe reality has some fixed out there, but we're not privy to that, so we're forced to use some "hyperprior" (am I using that word right?) on probability measures over . After conditioning on , we get some probability distribution on , which participants in a prediction market will take the expected value of as their answer. The precision is trying to quantify something like the standard deviation of this probability distribution on values of , right?
P.S. This is entirely a skill issue on my part but I'm not sure what symbols you're using for precision function and perturbation function. Detexify was of no use. Feel free to enlighten me!
If my interpretation of precision function is correct then I guess my main concern is this: how are we reaching inside the minds of the predictors to see what their distribution on is? Like, imagine we have an urn with black and red marbles in it and we have a prediction market on the probability that a uniformly randomly chosen marble will be red. Let's say that two people participated in this prediction market: Alice and Bob. Alice estimated there to be a 0.3269230769 (or approximately 17/52) chance of the marble being red because she saw the marbles being put in and there were 17 red marbles and 52 marbles total. Bob estimated there to be a 0.3269230769 chance of the marble being red because he felt like it. Bob is clearly providing false precision while Alice is providing entirely justified precision. However, no matter which way the urn draw goes, the input tuple (0.3269230769, 0) or (0.3269230769, 1) will be the same for both participants and thus the precision returned by any precision function will be the same. This feels to me like a fundamental disconnect between what we want to measure and what we are measuring. Am I mistaken in my understanding? Thanks!
Epistemic Status
Maybe not just reinventing the wheel, but the whole bicycle.
Say we have a set of resolved forecasts and can display them on a calibration plot.
We can grade the forecasts according to some proper scoring rule, e.g. the Brier score or the logarithmic scoring rule, maybe even broken up by calibration and resolution.
But we can also ask the question: how fine-grained are the predictions of our forecaster? I.e., at which level of precision can we assume that the additional information is just noise?
Overprecise Omar
Take, for example, a hypothetical forecaster Omar who always gives their forecasts with 5 decimal digits of precision, such as forecasting a "24.566% probability of North Korea testing an ICBM in the year 2022", even though if we look at their calibration plot (of sufficiently many forecasts), we see that they are pretty much random in any given interval of length 0.1 (i.e., their forecast with 15% and a forecast of 5% can be expected to resolve to the same outcome with equal probability). This means that 4 of the 5 decimal digits of precision are likely just noise!
Omar would be behaving absurdly; misleading their audience into believing they had spent much more time on their forecasts than they actually had (or, more likely, into correctly leading the audience into believing that there was something epistemically sketchy going on). It is certainly useful to use probability resilience, and not imprecision, to communicate uncertainty, but there is an adequate & finite limit to precision.
I believe something similar is going on when people encounter others putting probabilities on claims: It appears like an attempt at claiming undue quantitativeness (quantitativity?) in their reasoning, and at making the listener fall prey to precision bias, as well as an implicit claim at scientific rigour. However, not all precision in judgmental forecasting is false precision: At some point, if remove digits of precision, the forecasts will become worse in expectation.
But how might we confront our forecaster Omar from above? How might we estimate the level of degrees of precision after which their forecasts gave no more additional information?
Definitions
Ideally we'd want to find a number that tells us, for a given set of (resolved) forecasts, the precision that those predictions display: Any additional digits added to the probability beyond this precision would just be noise.
Let us call this number the precision ᚠ of a set of forecasts.
Let D=((f1,o1),…,(fn,on))∈((0,1),{0,1})n be a dataset of n forecasts fi and resolutions oi.
Then ᚠ is simply a function that takes in such a dataset of forecasts and produces a real number ᚠ:((0;1),{0,1})n→R, so for example ᚠ(D)=0.2 for the forecasts and outcomes D of some forecaster, or team of forecasters.
We call ᚠ the precision function.
Bits, Not Probabilities, for Precision
It is natural to assume that ᚠ returns a probability: after all, the input dataset has probabilities, and when talking about Omar's calibration plot I was explicitely calling out the loss of accuracy in probability intervals shorter than 0.1.
Furthermore, Friedman et al. 2018 also talk about precision in terms of probabilities (or rather "bins of probabilities"), we are all used to probabilities, probabilities are friends.
But this doesn't stand up to scrutiny: If we accept this, assuming we use probability "bins" or "buckets" of size 5%, then 99.99999% and 96% are as similar to each other as 51% and 54.99999%. But the readers familiar with the formulation of probability in log-odds form will surely balk at this equivalence: 99.99999% is a beast of a probability, an invocation only uttered in situations of extreme certainty, while 96%, 51% and 54.99999% (modulo false precision) are everyday probabilities, plebeian even.
And, in terms of precision, 54.99999% stands out like a sore thumb: while 99.99999% is supremely confident, it is not overprecise, since rounding up to 100% would be foolish; but with 54.99999%, there is no good reason we can't just round to 55%.
So precision should not be investigated on probabilities. Instead I claim that it should be calculated in log-odds space (which has an advantage over the odds form by being symmetric), where one moves in bits instead of probabilities. Since we want to make a statement how much we can move the probabilities around until the proper scoring rule we apply starts giving worse results, it is only natural to express the precision in bits as well. (Which can't be converted linearly into probabilities: moving from 50% to 75% or from 80% to 40% is one bit, but similarly moving from ~99.954% to ~99.977% is also a change of one bit). Instead one uses the logit function, and the logistic function to return to probabilities.
Algorithms!
The assumption of expressing precision in bits naturally leads to two different algorithms.
These algorithms follow a common pattern:
Log-Odds Rounding
A technique explored by @tenthkrige explores rounding in odds-form: The probabilities are converted into odds-form and there rounded to the nearest multiple of the perturbation parameter σ.
Here, the precision of the community predictions looks like it is somewhere between 0.5 and 1, and the precision of the Metaculus prediction is around 0.25 and 0.5, but it's hard to tell without access to the measurements.
Log-odds rounding is pretty similar to odds-rounding.
Implementation
Once a probability p has been converted into log-odds form lp, then rounding with a perturbation σ is simply l′p=σ⋅round(lp/σ).
Scoring the forecasts using the logarithmic scoring rule, one can then write this in a couple of lines of python code:
Full code for all algorithms is here.
Testing with Toy Data
One can then check whether everything is working nicely by applying the method to a couple of small (n=4,4,4,100,100,2000) toy datasets.
The smallest three datasets, all of n=4, are
d1
(red in the chart),d2
(blue in the chart) andd3
(green in the chart).They are
One notices that
d2
is just a slightly less resolute & precised1
, andd3
is pretty bad.Problems
Now this is not… great, and certainly quite different from the data by tenthkrige. I'm pretty sure this isn't a bug in my implementation or due to the switch from odds to log-odds, but a deeper problem with the method of rounding for perturbation.
There are two forces that make these charts so weird:
If you decrease precision by rounding, you can actually make a probability better by moving it closer to 0%/100%. If you have one forecast with probability p=0.75 and the outcome o=1. Without rounding, this has a log-score of ~-0.29. Setting σ=0.3 and rounding in log-odds space and transforming back gives pr≈0.77, with an improved log-score of ~-0.26. When one sees the weird zig-zag pattern for log-score, I believe that this is the reason. This is likely only an issue in smaller datasets: In larger ones, these individual random improving/worsening effects cancel each other out, and one can see the underlying trend of worsening (as is already visible with the purple plot for n=2000, and to a lesser extent brown & orange). Still, I count this as a major problem with rounding-based methods. Friedman et al. 2018 note the same: "Coarsening probability estimates could actually improve predictive accuracy if this prevents foreign policy analysts from making their judgments too extreme".
Rounding very strongly rounds everything to 50%, so with strong enough rounding every dataset has the same score. This has some counter-intuitive implications: If you are worse than chance, perturbing your probabilities more and more leads you reliably to a better score (in the case of the log score, log(0.5)≈−0.69). One can also see this in the plot above: All but one datasets end up with approximately the same log score. This isn't an property that kills rounding, since we in theory only care about the point where the perturbed score starts diverging from the unperturbed score, but it is still undesirable.
Advantages
That said, the method has one big advantage: It is quite fast, running ~instantaneously for even big datasets on my laptop.
Noise in Log-Odds Space
But one can also examine a different paradigm: Applying noise to the predictions. In our framework, this is concretely operationalized by repeatedly (s times) projecting the probabilities into log-odds space, applying some noise with width σ to them, and then calculating the resulting probabilities and scoring them, finally taking the mean of the scores.
There are some free parameters here, especially around the exact type of noise to use (normal? beta? uniform?), as well as the number s of samples.
I have decided to use uniform noise, but for no special mathematical reason, and adjust the number of samples by the size of the dataset (with small datasets n≈1000, less with bigger datasets).
This gives a much nicer looking plot, with n=500 samples:
The plots are falling ~monotonously, with a worse score for increasing noise, as expected. The score for
d2
drops more quickly than the one ford1
, maybe becaused2
is less precise thand1
? I'm not sure.Advantages and Problems
Noising log-odds has a bunch of advantages: As we increase the perturbation, the score falls ~monotonically (which I conjecture to always be true in the limit of infinitely many samples), and doesn't converge to a specific value as rounding-based methods do.
This can be seen when comparing increasing perturbation with the n=2000 dataset:
The disadvantage lies in the runtime: Taking many samples makes the method slow, but a small number of samples is too noisy to reliably detect when the the quality of forecasts starts dropping off. I think this is less of a problem with bigger datasets, but in the worst case I'd have to do a bunch of numpy-magic to optimize this further (or rewrite the code in a faster programming language, prototype in C here).
Finding Divergence
If the unperturbed dataset D has a score r=S(D), then we also need to find a value for σ with D′=ఐ(D,σ) so that r′=S(D′)≉r, with r′<r (in the case of scoring rules where lower scores are worse) or even that for all positive δ∈R and D′′=ఐ(D,σ+δ) it holds that r′′=S(D′′)<r′. That is, as we increase the perturbation more and more, the score becomes worse and worse.
The easiest way to do this is to iterate through values for σ, and once the difference between r and r′ is too big, return the corresponding σ. This method has disadvantages (the least-noteworthy difference between r and r′ is probably an arbitrary constant, simply iterating requires a bunch of compute and only finds an upper bound on a σ which is "too much"), but shines by virtue of being bog-simple and easy to implement. A more sophisticated method can use binary search to zero in on a σ that just crosses the threshold.
Unfortunately, this method doesn't really give reliable results for small sample sizes:
It seems like smaller datasets need higher sample sizes to adequately assess the precision using noise-based methods.
But this does tell us that ᚠ(d1)≈1.2 bits, and ᚠ(d1)≈1.3 bits.
Binary Search
The code can be changed to be faster, using binary search:
We can take advantage of the fact that we're not dealing the the indices of arrays here, so one can just divide by two as desired.
Now we can check whether the two methods give the ~same results:
I suspect that this is simply a problem of small sample sizes, though: Increasing the sample sizes by 5× doesn't change problem at all:
I'm kind of puzzled at this result, and not sure what to make of it. Maybe binary search biases the results downwards, while exponential search biases them upwards? We can check by changing the stepsize to linear:
This, prima facie, resolves our dilemma, and indicates that linear steps are better than exponential steps, and if speed is a problem, then binary search is better.
More Ideas
Rewriting the code to use noisy binary search, since the comparisons of scores are not reliable, might be a cool project.
But perhaps the entire idea of using a fixed value for divergence is flawed? The divergence point needs to depend on the perturbation parameter. I think I should instead assume there is a perfectly calibrated, infinitely big dataset D⋆. How much does its score drop if we perturb by σ? If D, perturbed, drops by half (or whatever) as much as the score of D⋆ then we have found our point.
A more sophisticated technique could try to estimate the elbow point of the declining curve of scores, but as far as I know there is no reliable method for doing so, nor is there a mathematical framework for this.
Precision of Forecasting Datasets
Equipped with some shaky heuristics about forecasting precision, one can now try to estimate the precision of different forecasting datasets (especially interesting should be comparing different platforms).
I will be doing so using the library iqisa.
Now we can use the algorithms for estimating precision on the dataset:
Apparently the markets were more precise than the survey forecasts. Interesting. Now let's see
Paul Allen'sMetaculus' precision.Harsh words from Metaculus here… the precision is higher here, to be clear: The smallest perturbation for which the score becomes noticeably worse is smaller.
Same with PredictionBook:
The difference is quite small but noticeable, and the purple and the green plot overlap a bunch: Metaculus and PredictionBook aren't so different (?):
The precision tracks the unperturbed score to a high degree. I'm not sure whether this is an airtight relation or just commonly the case, perhaps someone will look into this.
Usage
Given the precision of some forecaster or forecasting platform, one can much easier perform sensitivity analysis (especially after correcting for miscalibration), as it allows to create a distribution over what a probability could reasonably mean, given some resolved past data from the same source.
Knowing the precision of some source of forecasts can also be nice to stave off criticism of false precision: One can now answer "No, you're wrong: I'm being wholly correct at reporting my forecasts at 1.2 bits of precision!". Surely no incredulous stares with that.
Conclusion
When evaluating the precision of a set of forecasts, all sources I know agree that perturbing the forecasts in some way and then observing how the forecasts worsen with more perturbation is the correct way to go about evaluating the precision of forecasts.
However, the existing attempts are based on rounding: Moving probabilities, odds or log-odds that are close together to the same number.
I believe that these approaches are not good: For small datasets they produce large oscillations in the score, not smooth declines, and they improve the scores of worse-than-random forecast datasets.
Instead, a better way to estimate the precision is to apply noise to the forecasts and track how the score worsens. This has the advantage of producing monotonically declining scores.
Testing the method of noising forecasts on real-world datasets shows that they have similar precisions.
Appendix A: Conditions for a Precision Evaluation Function
Using a precision function ᚠ, perturbation function ఐ, proper scoring rule s and noise σ. These are very much me thinking out loud.
But what should be done about a calibration plot that looks like this?
There are two ways of arguing what, morally, the precision of the forecasts is:
Appendix B: Further Idea Sketches for Algorithms
n
forecasts and their resolutionsoutput
output