Comment author: isionous 13 January 2016 04:50:45PM 0 points [-]

Thank you for your response.

Comment author: jbay 14 January 2016 07:22:26AM 0 points [-]

You're welcome! And I'm sorry if I went a little overboard. I didn't mean it to sound confrontational.

Comment author: TheMajor 12 January 2016 10:34:35PM *  0 points [-]

I think it would be silly to resist to the idea that "X with probability P(X)" is equivalent to "~X with probability 1-P(X)". This statement is simply true.

However, it does not imply that prediction lists like this should include X and ~X as possible claims. To see this, let's consider person A who only lists "X, probability P", and person B who lists "X, probability P, and ~X, probability 1-P". Clearly these two are making the exact same claim about the future of the world. If we use an entropy rule to grade both of these people, we will find that no matter the outcome person B will have exactly twice the entropy (penalty) that person A has, so if we afterwards want to compare results of two people, only one of whom doubled up on the predictions, there is an easy way to do it (just double the penalty for those who didn't). So far so good: everything logically consistent, making the same claim about the world still easily lets you compare results aftewards. Nevertheless, there are two (related) things that need to be remarked, which is what I think all the controversy is over:

1) If, instead of the correct log weight rule, we use something stupid like a least-squares (or just eyeballing it per bracket), there is a significant difference between our people A and B above, precisely in their 50% predictions. For any probability assignment other than 50% the error rate at probability P and at 1-P are related and opposite, since getting a probability P prediction right (say, X), means getting a probability 1-P prediction wrong (~X). But for 50% these two get added up (with our stupid scoring rules) before being used to deduce calibration results. As a result we find that our doubler, player B, will always have exactly half of his 50% predictions right, which will score really well on stupid scoring rules (as an extreme example, to a naive scoring rule somebody who predicts 50% on every claim, regardless of logical constency, will seem to be perfectly calibrated).

2) Once we use a good scoring rule, i.e. the log rule, we can easily jump back and forth between people who double up on the claims and those who do not, as claimed/shown above.

In view of these two points I think that all of the magic is hidden in the scoring rule, not in the procedure used when recording the predictions. In other words, this doubling up does nothing useful. And since on calibration graphs people tend to think that getting half of your 50% predictions is really good, I say that the doubling version is actually slightly more misleading. The solution is clearly to use a proper scoring rule, and then you can do whatever you wish. But in reality it is best to not confuse your audience by artificially creating more dependencies between your predictions.

Comment author: jbay 13 January 2016 08:18:07AM 1 point [-]

X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.

Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn't!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).

I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.

I agree that there's no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.

I think we both agree that there should be no controversy about this -- however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That's what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions -- a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way -- is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there's certainly no reason to think that there is any significance to the choice between writing "X, with probability P" and "~X, with probability 1-P", even when P=50%.

If you still object to doubling the predictions, you can instead choose to take Scott's predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott's calibration as his original prediction set?

Comment author: isionous 12 January 2016 06:55:36PM 0 points [-]

Could you comment about how my strategy outlined above would not give useful information?

Comment author: jbay 13 January 2016 07:56:06AM *  1 point [-]

The calibration you get, by the way, will be better represented by the fact that if you assigned 50% to the candidate that lost, then you'll necessarily have assigned a very low probability to the candidate that won, and that will be the penalty that will tell you your calibration is wrong.

The problem is the definition of more specific. How do you define specific? The only consistent definition I can think of is that a proposition A is more specific than B if the prior probability of A is smaller than that of B. Do you have a way to consistently tell whether one phrasing of a proposition is more or less specific than another?

By that definition, if you have 10 candidates and no information to distinguish them, then the prior for any candidate to win is 10%. Then you can say "A: Candidate X will win" is more specific than "~A: Candidate X will not win", because P(A) = 10% and P(~A) = 90%.

Since the proposition "A with probability P" is the exact same claim as the proposition "~A with probability 1-P"; since they are the same proposition, there is no consistent definition of "specific" that will let one phrasing be more specific than the other when P = 50%.

"Candidate X will win the election" is only more specific than "Candidate X will not win the election" if you think that it's more likely that Candidate X will not win.

For example, by your standard, which of these claims feels more specific to you?

A: Trump will win the 2016 Republican nomination

B: One of either Scott Alexander or Eliezer Yudkowsky will win the 2016 Republican nomination

If you agree that "more specific" means "less probable", then B is a more specific claim than A, even though there are twice as many people to choose from in B.

Which of these phrasings is more specific?

C: The winner of the 2016 Republican nomination will be a current member of the Republican party (membership: 30.1 million)

~C: The winner of the 2016 Republican nomination will not be a current member of the Republican party (non-membership: 7.1 billion, or 289 million if you only count Americans).

The phrasing "C" certainly specifies a smaller number of people, but I think most people would agree that ~C is much less probable, since all of the top-polling candidates are party members. Which phrasing is more specific by your standard?

If you have 10 candidates, it might seem more specific to phrase a proposition as "Candidate X will win the election with probability 50%" than "Candidate X will not win the election with probability 50%". That intuition comes from the fact that an uninformed prior assigns them all 10% probability, so a claim that any individual one will win feels more specific in some way. But actually the specificity comes from the fact that if you claim 50% probability for one candidate when the uninformed prior was 10%, you must have access to some information about the candidates that allows you to be so confident. This will be properly captured by the log scoring rule; if you really do have such information, then you'll get a better score by claiming 50% probability for the one most likely to win rather than 10% for each.

Ultimately, the way you get information about your calibration is by seeing how well your full probability distribution about the odds of each candidate performs against reality. One will win, nine will lose, and the larger the probability mass you put on the winner, the better you do. Calibration is about seeing how well your beliefs score against reality; if your score depends on which of two logically equivalent phrasings you choose to express the same beliefs, there is some fundamental inconsistency in your scoring rule.

Comment author: isionous 12 January 2016 02:52:49PM 1 point [-]

The method described in my post handles this situation perfectly well. All of your 50% predictions will (necessarily) come true 50% of the time, but you rack up a good calibration score if you do well on the rest of the predictions.

Seems like you're giving up trying to get useful information about yourself from the 50% confidence predictions. Do you agree?

Comment author: jbay 12 January 2016 05:46:49PM 0 points [-]

Yes, but only because I don't agree that there was any useful information that could have been obtained in the first place.

Comment author: TheMajor 05 January 2016 06:55:30PM 0 points [-]

The whole point of assigning 50% probability to a claim is that you literally have no idea whether or not it will happen. So of course including X or ~X in any statement is going to be arbitrary. That's what 50% means.

However, this is not solved by doubling up on your predictions, since now (by construction) your predictions are very dependent. I don't understand the controversy about Scott getting 0/3 on 50% predictions - it even happens to perfectly calibrated people 1/8 times, let alone real humans. If you have a long list of statements you are 50% certain about, you have literally no reason to not put one side of an issue instead of the other side on your prediction list. If, however, afterwards it turns out that significantly less than half of your (arbitrarily chosen) sides turned out to be wrong, you probably aren't very good at recognising when you are 50% confident (to make this more clear, imagine Scott had gotten 0/100 instead of 0/3 on his 50% predictions).

Comment author: jbay 12 January 2016 08:07:03AM *  0 points [-]

I don't understand why there is so much resistance to the idea that stating "X with probability P(X)" also implies "~X with probability 1-P(X)". The point of assigning probabilities to a prediction is that it represents your state of belief. Both statements uniquely specify the same state of belief, so to treat them differently based on which one you wrote down is irrational. Once you accept that these are the same statement, the conclusion in my post is inevitable, the mirror symmetry of the calibration curve becomes obvious, and given that symmetry, all lines must pass through the point (0.5,0.5).

Imagine the following conversation:

A: "I predict with 50% certainty that Trump will not win the nomination".

B: "So, you think there's a 50% chance that he will?"

A: "No, I didn't say that. I said there's a 50% chance that he won't."

B: "But you sort of did say it. You said the logically equivalent thing."

A: "I said the logically equivalent thing, yes, but I said one and I left the other unsaid."

B: "So if I believe there's only a 10% chance Trump will win, is there any doubt that I believe there's a 90% chance he won't?

A: "Of course, nobody would disagree, if you said there's a 10% chance Trump will win, then you also must believe that there's a 90% chance that he won't. Unless you think there's some probability that he both will and will not win, which is absurd."

B: "So if my state of belief that there's a 10% chance of A necessarily implies I also believe a 90% chance of ~A, then what is the difference between stating one or the other?"

A: "Well, everyone agrees that makes sense for 90% and 10% confidence. It's only for 50% confidence that the rules are different and it matters which one you don't say."

B: "What about for 50.000001% and 49.999999%?"

A: "Of course, naturally, that's just like 90% and 10%."

B: "So what's magic about 50%?"

Comment author: isionous 11 January 2016 11:19:31PM *  2 points [-]

For the problem of how to phrase the 50% confidence predictions, it might be useful to use the more specific proposition (or one that has the smaller uninformative prior probability). For instance, if you have a race between 10 candidates, and you think candidate X has a 50% chance to win, you should phrase your prediction as "Candidate X will win the election" rather than "Candidate X will not win the election".

If you consistently phrase your 50% confidence predictions this way, then your prediction results tell you something useful. If your 50% confidence predictions come true only 10% of the time, then maybe your problem is overconfidence in your reasons to deviate from uninformed priors or overconfidence in concocting reasons to believe a certain hypothesis in the vast hypothesis space.

edit: The biggest weakness of this approach is what do you do when you're choosing between something like "this coin flip will come up heads" and "this coin flip will come up tails"? Or when it is unclear what an uninformed prior would even be.

Comment author: jbay 12 January 2016 07:52:38AM *  0 points [-]

"Candidate X will win the election with 50% probability" also implies the proposition "Candidate X will not win the election with 50% probability". If you propose one, you are automatically proposing both, and one will inevitably turn out true and the other false.

If you want to represent your full probability distribution over 10 candidates, you can still represent it as binary predictions. It will look something like this:

Candidate 1 will win the election: 50% probability

Candidate 2 will win the election: 10% probability

Candidate 3 will win the election: 10% probability

Candidate 4 will win the election: 10% probability

Candidate 5 will win the election: 10% probability

Candidate 6 will win the election: 2% probability

...

Candidate 1 will not win the election: 50% probability

Candidate 2 will not win the election: 90% probability

...

The method described in my post handles this situation perfectly well. All of your 50% predictions will (necessarily) come true 50% of the time, but you rack up a good calibration score if you do well on the rest of the predictions.

Comment author: entirelyuseless 06 January 2016 04:01:00PM 0 points [-]

Lotteries frequently have secondary prizes. I have never bought a lottery ticket, but once I won $50 with a lottery ticket that someone else bought for me (which I did not ask for). If you include results like this as a win, your probability of under one in a billion of such a win may be overconfident.

Comment author: jbay 07 January 2016 01:28:52AM 0 points [-]

Agreed.

Comment author: ChristianKl 04 January 2016 11:44:26AM 0 points [-]

Let's be naughty and look at it from the frequentist perspective. What does 50% Jeb Bush is the top polling Republican candidate mean?

It translates into 50% of the possible worlds have Jeb as the top polling Republican candidate. Does that mean the person making that statement says 50 out of 100 possible worlds have Jeb Bush as the top pooling Republican candidate but not 49 or 51 possible worlds?

Maybe it should be read as in between 45 and 54 possible world Jeb is top polling in some contests but being read as between 30 and 70 possible worlds in other contexts.

Comment author: jbay 05 January 2016 07:41:05AM 1 point [-]

I'm really not so sure what a frequentist would think. How would they express "Jeb Bush will not be the top-polling Republican candidate" in the form of a repeated random experiment?

It seems to me more likely that a frequentist would object to applying probabilities to such a statement.

Comment author: Manfred 04 January 2016 08:33:56PM *  0 points [-]

Failing to assign the correct probability given your information is a failure both of accuracy and of calibration.

Suppose you take a test of many multiple choice questions (say, 5 choices), and for each question I elicit from you your probability of having the right answer. Accuracy is graded by your total score on the test. Calibration is graded by your log-score on the probabilities. Our lottery enthusiast might think they're 50% likely to have the right answer even when they are picking randomly - and because of this they will have a lower log score than someone who correctly thinks they have a 1/5 chance. These two people may have the same scores on the test, but they will have different scores on their ability to assign probabilities.

Comment author: jbay 05 January 2016 07:23:12AM *  0 points [-]

I have updated my post to respond to your concerns, expanding on your lottery example in particular. Let me know if I've adequately addressed them.

Comment author: Manfred 04 January 2016 08:40:03AM *  0 points [-]

X and not-X should only be treated symmetrically when the predictor's information about them was symmetrical. I.e. rarely.

For example, suppose someone buys a lottery ticket each week and gives themselves a 50% chance of winning (after all, they could either win or not win,so that's 50/50). This person is known to not be well-calibrated.

Also, rather than squared error, it's probably best to use the log scoring rule (has decent information-theory motivation, does a better job at handling high and low probabilities).

Comment author: jbay 04 January 2016 04:56:30PM *  0 points [-]

I intend to update the article later to include log error. Thanks!

The lottery example though is a perfect reason to be careful about how much importance you place on calibration over accuracy.

View more: Next