I think it would be silly to resist to the idea that "X with probability P(X)" is equivalent to "~X with probability 1-P(X)". This statement is simply true.
However, it does not imply that prediction lists like this should include X and ~X as possible claims. To see this, let's consider person A who only lists "X, probability P", and person B who lists "X, probability P, and ~X, probability 1-P". Clearly these two are making the exact same claim about the future of the world. If we use an entropy rule to grade both of these people, we will find that no matter the outcome person B will have exactly twice the entropy (penalty) that person A has, so if we afterwards want to compare results of two people, only one of whom doubled up on the predictions, there is an easy way to do it (just double the penalty for those who didn't). So far so good: everything logically consistent, making the same claim about the world still easily lets you compare results aftewards. Nevertheless, there are two (related) things that need to be remarked, which is what I think all the controversy is over:
1) If, instead of the correct log weight rule, we use something stupid like a least-squares (or just eyeballing it per bracket), there is a significant difference between our people A and B above, precisely in their 50% predictions. For any probability assignment other than 50% the error rate at probability P and at 1-P are related and opposite, since getting a probability P prediction right (say, X), means getting a probability 1-P prediction wrong (~X). But for 50% these two get added up (with our stupid scoring rules) before being used to deduce calibration results. As a result we find that our doubler, player B, will always have exactly half of his 50% predictions right, which will score really well on stupid scoring rules (as an extreme example, to a naive scoring rule somebody who predicts 50% on every claim, regardless of logical constency, will seem to be perfectly calibrated).
2) Once we use a good scoring rule, i.e. the log rule, we can easily jump back and forth between people who double up on the claims and those who do not, as claimed/shown above.
In view of these two points I think that all of the magic is hidden in the scoring rule, not in the procedure used when recording the predictions. In other words, this doubling up does nothing useful. And since on calibration graphs people tend to think that getting half of your 50% predictions is really good, I say that the doubling version is actually slightly more misleading. The solution is clearly to use a proper scoring rule, and then you can do whatever you wish. But in reality it is best to not confuse your audience by artificially creating more dependencies between your predictions.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
Thank you for your response.
You're welcome! And I'm sorry if I went a little overboard. I didn't mean it to sound confrontational.