I agree, of course, that a bad prediction can perform better than a good prediction by luck. That means if you were already sufficiently sure your prediction was good, you can continue to believe it was good after it performs badly. But your belief that the prediction was good then comes from your model of the sources of the competing predictions prior to observing the result (e.g. "PredictIt probably only predicted a higher Trump probability because Trump Trump Trump") instead of from the result itself. The result itself still reflects badly on your prediction. Your prediction may not have been worse, but it performed worse, and that is (perhaps insufficient) Bayesian evidence that it actually was worse. If Nate Silver is claiming something like "sure, our prediction of voter % performed badly compared to PredictIt's implicit prediction of voter %, but we already strongly believed it was good, and therefore still believe it was good, though with less confidence", then I'm fine with that. But that wasn't my impression.
edit:
Deviating from the naive view implicitly assumes that confidently predicting a narrow win was too hard to be plausible
I agree I'm making an assumption like "the difference in probability between a 6.5% average poll error and a 5.5% average poll error isn't huge", but I can't conceive of any reason to expect a sudden cliff there instead of a smooth bell curve.
One thing that falls out of this post's framework is that it makes sense to say that one prediction (and in extension, one probability) is better than another, but it doesn't make sense to talk about the correct probability – unless 'correct' is defined as the point of full information, in which case it is usually unattainable.
This is all well and good, but I feel you've just rephrased the problem rather than resolving it. Specifically I think you're saying the better prediction is the one which contains more information. But how can we tell (in your framework) which model contained more information.
If I had to make my complaint more concrete, it would be to ask you to resolve which of "Story 1" and "Story 2" is more accurate? You seem to claim that we can tell which is better, but you don't seem to actually do so. (At least based on my reading).
I think there are some arguments but I deliberately didn't mention any to keep the post non-political. My claim in the post is only that an answer exists (and that you can't get it from just the outcome), not that it's easy to find. I.e., I was only trying to answer the philosophical question, not the practical question.
In which case, isn't there a much shorter argument. "Given an uncertain event there is some probability that event occurs and that probability depends on the information you have about the event". That doesn't seem very interesting to me?
To make this framework even less helpful (in the instance of markets) - we can't know what information they contain. (We can perhaps know a little more in the instance of the presidential markets because we can look at the margin and state markets - BUT they were inconsistent with the main market and also don't tell you what information they were using).
Edit 2020/12/13: I think a lot of people have misunderstood my intent with this post. I talk about the 2020 election because it seemed like the perfect example to illustrate the point, but it was still only an example. The goal of this post is to establish a philosophically sound ground truth for probabilities.
1. Motivation
It seems to me that:
In this post, I work out a framework in which there is a ground truth for probabilities, which justifies 1-3.
2. The Framework
Consider the following game:
Using the binomial formula, we can compute that the probability of winning this game is around 0.69135. (In mathy notation, that's P(X≤52) with X∼B(100,12).)
Suppose we play this game, writing down each coin flip as it occurs, and put the progression into a chart:
Here, the red line denotes the number of heads we have after n flips, and the blue line denotes the 'baseline', which is the number of heads such that continuing at this pace would end up at precisely 52 heads after 100 flips. We end the game above the baseline, with 54 heads, which means that we lose.
Since we know exactly how this game works, it's possible to compute the current probability of winning at every point during the game.[1] Here is a chart of these probabilities, given the flips from the chart above:
Note that the y-axis shows the probability of winning the game after observing the n-th flip. Thus, there are precisely 101 values here, going from the probability after observing the 0-th flip (i.e., without having seen any flips), which is ≈0.69135, to the probability after observing the 100-th flip, which is 0. By comparing both graphs visually, you can verify that they fit together.
Each of these 101 y-values is a prediction for the outcome of the game. However, they're evidently not equally well-informed predictions: the ones further to the right are based on more information (more flips) than those further to the left, and this is true even for predictions that output the same probability. For example, we predict almost exactly 80% after both 13 and 51 flips, but the latter prediction has a lot more information to go on.
I call a graph like this an information chart. It tells us how the probability of a prediction changes as a function of {amount of input information}.
A separate aspect that also influences the quality of a prediction is calibration. In this case, all of the 101 predictions made by the blue curve are perfectly calibrated: if we ran the game a million times, took all 101 million predictions made by the one million blue curves, and put them all into bins, each bin would (very likely) have a proportion of true predictions that closely resembles its probability.[2]
However, while the blue curve has perfect calibration, we can also model imperfect calibration in this framework. To do this, I've computed the probability of winning, provided that we underestimate (red) or overestimate (green) the number of flips that are still missing, relative to those we know.[3] The results look like this:
Looking at these three information charts together, we can see the following:
Notably, 'good' and 'bad' have nothing to do with 50%. Instead, they are determined by the prior, the probability you would assign without any information. In this case, the prior is 0.69135, so the red curve overshoots the blue curve whenever the blue curve is above 0.69135 and undershoots the blue curve whenever the blue curve is below 0.69135. (And the green curve does the opposite in both cases.)
Unlike the blue curve, the red curve's predictions are not perfectly calibrated. Their calibration is pretty good around the 70% bin (because the prior 0.69135 happens to be close to 70%), but poor everywhere else. Predictions in the 90% bin would come true less than 9 out of 10 times, and predictions in the 50% bin would come true more than 5 out of 10 times. (And the opposite is true for the green curve.)
In summary, the following concepts fall out of this framework:
(The attentive reader may also notice that "50% predictions can be meaningful" follows as an easy corollary from the above.)
My primary claim in this post is that information charts are the proper way to look at predictions – not just in cases where we know all of the factors, but always. It may be difficult or impossible to know what the information chart looks like in practice, but it exists and should be considered the philosophical ground truth. (Information isn't a one-dimensional spectrum, but I don't think that matters all that much; more on that later.[5]) Doing so seems to resolve most philosophical problems.
I don't think it is in principle possible to weigh calibration and information against each other. Good calibration removes bias; more information moves predictions further away from the baseline. Depending on the use case, you may rather have a bunch of bold predictions that are slightly miscalibrated or a bunch of cautious predictions with perfect calibration. However, it does seem safe to say that:
3. A Use Case: The 2020 Election
Here is an example of where I think the framework helps resolve philosophical questions.
On the eve of the 2020 election, the model of 538 coded by Nate Silver has predicted an 89% probability of Biden winning a majority in the Electoral College (with 10% for Trump and 1% for a tie). At the same time, a weighted average of prediction markets had Biden at around 63% for becoming the next president.[6] At this point, we know that
The first convenient assumption we will make here is that both predictions had perfect calibration. (This is arguably almost true.) Given that, the only remaining question is which prediction was made with more information.
To get my point across, it will be convenient to assume that we know how the information chart for this prediction looks like:
If you accept this, then there are two stories we can tell about 538 vs. Betting markets.[8]
Story 1: The Loyal Trump Voter
In the first story, no-one had foreseen the real reasons why polls were off in favor of Trump; they may as well have been off in favor of Biden. Consequently, no-one had good reasons to assign Biden a <89% chance of winning, and the people who did so anyway would have rejected their reasons if they had better information/were more rational.
If the above is true, it means everyone who bet on Biden, as some on LessWrong have advised, has taken a good deal. However, there is also a different story.
Story 2: The Efficient Market
In the second story, the markets knew something that modelers didn't: a 2016-style polling error in the same direction was to be expected. They didn't know how large it would be exactly, but they knew it was large enough for 63% to be a better guess than 89% (and perhaps the implied odds by smart gamblers were even lower, and the price only came out 63% because people who trusted 538 bought overpriced Biden shares). The outcome we did get (a ~0.7% win for Biden) was among the most likely outcomes as projected by the market.
Alas, betting on Biden was a transaction which, from the perspective of someone knowing what the market knew, had negative expected returns.
In reality, there was probably at least a bit of both worlds going on, and the information chart may not be accurate in the first place. Given that, either of the two scenarios above may or may not describe the primary mechanism by which pro-Trump money entered the markets. However, even if you reject them both, the only specific claim I'm making here is that the election could have been such that the probability changed non-monotonically with the amount of input information, i.e.:
If true, this yields a non-injective function, meaning that there are specific probabilities that have been implied by several positions on the information chart, such as the 63% for Biden. Because of this, we cannot infer the quality of the prediction solely based on the stated probability.
And yes: real information is not one-dimensional. However, the principles still work in a 1000000-dimensional space:
Thus, to sum up this post, these are the claims I strongly believe to be true:
And, perhaps most controversially:
4. Appendix: Correct Probabilities and Scoring Functions
I've basically said what I wanted to say at this point – the fourth chapter is there to overexplain/make more arguments.
One thing that falls out of this post's framework is that it makes sense to say that one prediction (and in extension, one probability) is better than another, but it doesn't make sense to talk about the correct probability – unless 'correct' is defined as the point of full information, in which case it is usually unattainable.
This also means that there are different ways to assign probabilities to a set of statements that are all perfectly calibrated. For example, consider the following eight charts that come from eight runs of the 100-coins game:
There are many ways of obtaining a set of perfectly calibrated predictions from these graphs. The easiest is to throw away all information and go with the prior every time (which is the starting point on every graph). This yields eight predictions that all claim a 0.69135 chance of winning.[10] Alternatively, we can cut off each chart after the halfway point:
This gives us a set of eight predictions that have different probabilities from the first set, despite predicting the same thing – and yet, they are also perfectly calibrated. Again, unless we consider the point of full information, there is no 'correct' probability, and the same chart may feature a wide range of perfectly calibrated predictions for the same outcome.
You probably know that there are scoring functions for predictions. Our framework suggests that the quality of predictions is a function of {amount of information} and {calibration}, which begs the question of which of the two things scoring functions measure. (We would hope that they measure both.) What we can show is that, for logarithmic scoring,
The second of these properties implies that, the later we cut off our blue curves, the better a score the resulting set of predictions will obtain – in expectation.
Let's demonstrate both of these. First, the rule. Given a set of predictions p1,...,pn (with pi∈[0,1]) and a set of outcomes y1,...,yn (with yi∈{0,1}), logarithmic scoring assigns to this set the number
n∑i=1[yilog(pi)+(1−yi)log(1−pi)]
Since the yi are either 1 or 0, the formula amounts to summing up [the logarithms of the probability mass assigned to the true outcome] across all our predictions. E.g., if I make five 80% predictions and four of them come true, I sum up 4log(0.8)+log(0.2).
Note that these terms are all negative, so the highest possible score is the one with the smallest absolute value. Note also that log(x) converges to −∞ as x goes to 0: this corresponds to the fact that, if you increase your confidence in a prediction but are wrong, your punishment grows indefinitely. Predicting 0% for something that comes true yields a score of −∞.
I think the argument about calibration is the less interesting part (I don't imagine anyone is surprised that logarithmic scoring rewards good calibration), so I've relegated it into a footnote.[11]
Let's look at information. Under the assumption of perfect calibration, we know that any prediction-we-have-assigned-probability-p will, indeed, come true with probability p. Thus, the expected score for such a prediction[12] is
L(p):=plog(p)+(1−p)log(1−p)
We can plot L for p∈[0,1]. It looks like this:
This shows us that, for any one prediction, a more confident verdict is preferable, provided calibration is perfect. That alone does not answer our question. If we increase our amount of information – if we take points further to the right on our blue curves – some predictions will have increased probability, others will have decreased probability. You can verify this property by looking at some of the eight charts I've pictured above. What we can say is that
Thus, the question is whether moving away from p in both directions, such that the total probability mass remains constant, yields a higher score. In other words, we want that
2L(p)<L(p−ϵ)+L(p+ϵ)∀ϵ∈(0,min{p,1−p})
Fortunately, this inequality is immediate from the fact that L is strictly convex (which can be seen from the graph pictured above).[13] Similar things are true for the Brier score.[14]
I.e., the probability of having 52 heads total in 100 flips, conditional on the flips we've already seen. For example, the first two flips have come up tails in our case, so after flip #2, we're hoping for at most 52 heads in the next 98 flips. The probability for this is P(X≤52) with X∼B(98,12), which is about 76%. Similarly, we've had 12 heads and 10 tails after 22 flips, so after flip #22, we're hoping for at most 40 heads in the next 78 flips. The probability for this is P(X≤40) with X∼B(78,12), which is about 63%. ↩︎
To spell this out a bit more: we would run the game a million times and create a chart like the one I've shown for each game. Since each chart features points at 101 x-positions, we can consider these 101 million predictions about whether a game was won. We also know how to score these predictions since we know which games were won and which were lost. (For example, in our game, all predictions come out false since we lost the game.)
Then, we can take all predictions that assign a probability between 0.48 and 0.52 and put them into the '50%' bin. Ideally, around half of the predictions in this bin should come true – and this is, in fact, what will happen. As you make the bins smaller and increase the number of runs (go from a million to a billion etc.), the chance that the bins are wrong by at least ϵ converges to 0, for every value of ϵ∈R+.
All of the above is merely a formal way of saying that these predictions are perfectly calibrated. ↩︎
To be precise, what I've done is to take the function f(x)=x−x2100, which looks like this:
and use that to model how deluded each prediction is about the amount of information it has access to. I.e., after the 50-th flip, the red curve assumes it has really seen 50+f(50)=75 flips, and that only 25 are missing. The value of those 75 flips is extrapolated from the 50, so if exactly half of the real 50 have come up heads, it assumes that 37.5 of the 75 have come up heads. In this case, this would increase its confidence that the final count of heads will be 52 or lower.
In general, after having seen n flips, the red curve assumes it knows n+f(n) flips. Since f(0)=0 and f(100)=0, it starts and ends at the same point as the blue curve. Similarly, the green curve assumes it knows n−f(n) many flips.
This may not be the best way to model overconfidence since it leads to 100% and 0% predictions. Then again, real people do that, too. ↩︎
It's worth pointing out that this makes calibration a property of single predictions, not an emergent property of sets of predictions. This seems like a good thing to me; if someone predicts a highly complex sequence of events with 50% probability, I generally don't feel that I require further predictions to judge its calibration. ↩︎
Furthermore, 'information' isn't restricted to the literal input data but is meant to be a catch-all for everything that helps to predict something, including better statistical models or better intuitions. ↩︎
The outcomes for these predictions may come apart if someone who didn't win the election becomes president. (Although BetFair supposedly has ‘projected winner’ as its criterion.) ↩︎
That's Pennsylvania (~0.7% margin), Wisconsin (~0.7% margin), Georgia (~0.2% margin), and Arizona (~0.3% margin). ↩︎
Note that there are a lot more (and better) markets than PredictIt, I'm just using it in the image because it has a nice logo. ↩︎
To expand on this more: arguing that the market's prediction was better solely based on the implied margin seems to me to be logically incoherent:
Incidentally, this set of eight predictions appears poorly calibrated: given their stated probability of 0.69135, we would expect about five and a half to come true (so 5 or 6 would be the good results), yet only 4 did. However, this is an artifact of our sample being small. Perfect calibration does not imply perfect-looking calibration on any fixed sample size; it only implies that the probability of apparent calibration being off by some fixed amount converges to zero as the sample size grows. ↩︎
Consider a set of predictions from the same bin, i.e., that we have assigned the same probability, p. Suppose their real frequency is p∗. We would now hope that the probability which maximizes our score is p∗. For each prediction in this bin, since it will come true with probability p∗, we will have probability p∗ to receive the score log(p) and probability 1−p∗ to receive the score log(1−p∗). In other words, our expected score is
p∗log(p)+(1−p∗)log(1−p)
To find out what value of p minimizes this function, we take the derivative:
p∗p−1−p∗1−p
This term is 0 iff p1−p=p∗1−p∗. Since the function ϕ(x)=x1−x is injective, we can apply ϕ−1 to both sides and obtain p=p∗ as the unique solution. Thus, calibration is indeed rewarded. ↩︎
It suffices to consider a single prediction since the scoring function is additive across predictions. ↩︎
Strict convexity says that, for all δ∈(0,1), we have
L(δx+[1−δ]y)<δL(x)+[1−δ]L(y)
Set δ=12 and x=p−ϵ and y=p+ϵ, then multiply the equation by 2. ↩︎
The Brier score measures negative squared distance to the outcomes, scaled by 1n. I.e., in the notation we've used for logarithmic scoring, we assign the number
−1nn∑i=1(pi−yi)2
The two properties we've verified for logarithmic scoring hold for the Brier score as well. Assuming perfect calibration, the expected Brier score for a prediction with probability p is −p(1−p)2−(1−p)p2. The corresponding graph looks like this:
Since this function is also strictly convex, the second property is immediate.
However, unlike logarithmic scoring, Brier score has bounded penalties. Predicting 0% for an outcome that occurs yields a score of −1 rather than −∞. ↩︎