In row 8 of the table, P(D) should be replaced by P(~D).
A note about calibration of confidence
Background
In a recent Slate Star Codex Post (http://slatestarcodex.com/2016/01/02/2015-predictions-calibration-results/), Scott Alexander made a number of predictions and presented associated confidence levels, and then at the end of the year, scored his predictions in order to determine how well-calibrated he is. In the comments, however, there arose a controversy over how to deal with 50% confidence predictions. As an example, Scott has these predictions at 50% confidence, among his others:
|
Proposition |
Scott's Prior |
Result |
|
|
A |
Jeb Bush will be the top-polling Republican candidate |
P(A) = 50% |
A is False |
|
B |
Oil will end the year greater than $60 a barrel |
P(B) = 50% |
B is False |
|
C |
Scott will not get any new girlfriends |
P(C) = 50% |
C is False |
|
D |
At least one SSC post in the second half of 2015 will get > 100,000 hits: 70% |
P(D) = 70% |
D is False |
|
E |
Ebola will kill fewer people in second half of 2015 than the in first half |
P(E) = 95% |
E is True |
Scott goes on to score himself as having made 0/3 correct predictions at the 50% confidence interval, which looks like significant overconfidence. He addresses this by noting that with only 3 data points it’s not much data to go by, and could easily have been correct if any of those results had turned out differently. His resulting calibration curve is this:

However, the commenters had other objections about the anomaly at 50%. After all, P(A) = 50% implies P(~A) = 50%, so the choice of “I will not get any new girlfriends: 50% confidence” is logically equivalent to “I will get at least 1 new girlfriend: 50% confidence”, except that one results as true and the other false. Therefore, the question seems sensitive only to the particular phrasing chosen, independent of the outcome.
One commenter suggests that close to perfect calibration at 50% confidence can be achieved by choosing whether to represent propositions as positive or negative statements by flipping a fair coin. Another suggests replacing 50% confidence with 50.1% or some other number arbitrarily close to 50%, but not equal to it. Others suggest getting rid of the 50% confidence bin altogether.
Scott recognizes that predicting A and predicting ~A are logically equivalent, and choosing to use one or the other is arbitrary. But by choosing to only include A in his data set rather than ~A, he creates a problem that occurs when P(A) = 50%, where the arbitrary choice of making a prediction phrased as ~A would have changed the calibration results despite being the same prediction.
Symmetry
This conundrum illustrates an important point about these calibration exercises. Scott chooses all of his propositions to be in the form of statements to which he assigns greater or equal to 50% probability, by convention, recognizing that he doesn’t need to also do a calibration of probabilities less than 50%, as the upper-half of the calibration curve captures all the relevant information about his calibration.
This is because the calibration curve has a property of symmetry about the 50% mark, as implied by the mathematical relation P(X) = 1- P(~X) and of course P(~X) = 1 –P(X).
We can enforce that symmetry by recognizing that when we make the claim that proposition X has probability P(X), we are also simultaneously making the claim that proposition ~X has probability 1-P(X). So we add those to the list of predictions and do the bookkeeping on them too. Since we are making both claims, why not be clear about it in our bookkeeping?
When we do this, we get the full calibration curve, and the confusion about what to do about 50% probability disappears. Scott’s list of predictions looks like this:
|
Proposition |
Scott's Prior |
Result |
|
|
A |
Jeb Bush will be the top-polling Republican candidate |
P(A) = 50% |
A is False |
|
~A |
Jeb Bush will not be the top-polling Republican candidate |
P(~A) = 50% |
~A is True |
|
B |
Oil will end the year greater than $60 a barrel |
P(B) = 50% |
B is False |
|
~B |
Oil will not end the year greater than $60 a barrel |
P(~B) = 50% |
~B is True |
|
C |
Scott will not get any new girlfriends |
P(C) = 50% |
C is False |
|
~C |
Scott will get new girlfriend(s) |
P(~C) = 50% |
~C is True |
|
D |
At least one SSC post in the second half of 2015 will get > 100,000 hits: 70% |
P(D) = 70% |
D is False |
|
~D |
No SSC post in the second half of 2015 will get > 100,000 hits |
P(~D) = 30% |
~D is True |
|
E |
Ebola will kill fewer people in second half of 2015 than the in first half |
P(E) = 95% |
E is True |
|
~E |
Ebola will kill as many or more people in second half of 2015 than the in first half |
P(~E) = 05% |
~E is False |
You will by now have noticed that there will always be an even number of predictions, and that half of the predictions always are true and half are always false. In most cases, like with E and ~E, that means you get a 95% likely prediction that is true and a 5%-likely prediction that is false, which is what you would expect. However, with 50%-likely predictions, they are always accompanied by another 50% prediction, one of which is true and one of which is false. As a result, it is actually not possible to make a binary prediction at 50% confidence that is out of calibration.
The resulting calibration curve, applied to Scott’s predictions, looks like this:

Sensitivity
By the way, this graph doesn’t tell the whole calibration story; as Scott noted it’s still sensitive to how many predictions were made in each bucket. We can add “error bars” that show what would have resulted if Scott had made one more prediction in each bucket, and whether the result of that prediction had been true or false. The result is the following graph:

Note that the error bars are zero about the point of 0.5. That’s because even if one additional prediction had been added to that bucket, it would have had no effect. That point is fixed by the inherent symmetry.
I believe that this kind of graph does a better job of showing someone’s true calibration. But it's not the whole story.
Ramifications for scoring calibration (updated)
Clearly, it is not possible to make a binary prediction with 50% confidence that is poorly calibrated. This shouldn’t come as a surprise; a prediction at 50% between two choices represents the correct prior for the case where you have no information that discriminates between X and ~X. However, that doesn’t mean that you can improve your ability to make correct predictions just by giving them all 50% confidence and claiming impeccable calibration! An easy way to "cheat" your way into apparently good calibration is to take a large number of predictions that you are highly (>99%) confident about, negate a fraction of them, and falsely record a lower confidence for those. If we're going to measure calibration, we need a scoring method that will encourage people to write down the true probabilities they believe, rather than faking low confidence and ignoring their data. We want people to only claim 50% confidence when they genuinely have 50% confidence, and we need to make sure our scoring method encourages that.
A first guess would be to look at that graph and do the classic assessment of fit: sum of squared errors. We can sum the squared error of our predictions against the ideal linear calibration curve. If we did this, we would want to make sure we summed all the individual predictions, rather than the averages of the bins, so that the binning process itself doesn’t bias our score.
If we do this, then our overall prediction score can be summarized by one number:
Here P(Xi) is the assigned confidence of the truth of Xi, and Xi is the ith proposition and has a value of 1 if it is True and 0 if it is False. S is the prediction score, and lower is better. Note that because these are binary predictions, the sum of squared errors gives an optimal score if you assign the probabilities you actually believe (ie, there is no way to "cheat" your way to a better score by giving false confidence).
In this case, Scott's score is S=0.139, much of this comes from the 0.4/0.6 bracket. The worst score possible would be S=1, and the best score possible is S=0. Attempting to fake a perfect calibration by everything by claiming 50% confidence for every prediction, regardless of the information you actually have available, yields S=0.25 and therefore isn't a particularly good strategy (at least, it won't make you look better-calibrated than Scott).
Several of the commenters pointed out that log scoring is another scoring rule that works better in the general case. Before posting this I ran the calculus to confirm that the least-squares error did encourage an optimal strategy of honest reporting of confidence, but I did have a feeling that it was an ad-hoc scoring rule and that there must be better ones out there.
The logarithmic scoring rule looks like this:
Here again Xi is the ith proposition and has a value of 1 if it is True and 0 if it is False. The base of the logarithm is arbitrary so I've chosen base "e" as it makes it easier to take derivatives. This scoring method gives a negative number and the closer to zero the better. The log scoring rule has the same honesty-encouraging properties as the sum-of-squared-errors, plus the additional nice property that it penalizes wrong predictions of 100% or 0% confidence with an appropriate score of minus-infinity. When you claim 100% confidence and are wrong, you are infinitely wrong. Don't claim 100% confidence!
In this case, Scott's score is calculated to be S=-0.42. For reference, the worst possible score would be minus-infinity, and claiming nothing but 50% confidence for every prediction results in a score of S=-0.69. This just goes to show that you can't win by cheating.
Example: Pretend underconfidence to fake good calibration
In an attempt to appear like I have better calibration than Scott Alexander, I am going to make the following predictions. For clarity I have included the inverse propositions in the list (as those are also predictions that I am making), but at the end of the list so you can see the point I am getting at a bit better.
|
Proposition |
Quoted Prior |
Result |
|
|
A |
I will not win the lottery on Monday |
P(A) = 50% |
A is True |
|
B |
I will not win the lottery on Tuesday |
P(B) = 66% |
B is True |
|
C |
I will not win the lottery on Wednesday |
P(C) = 66% |
C is True |
|
D |
I will win the lottery on Thursday |
P(D) =66% |
D is False |
|
E |
I will not win the lottery on Friday |
P(E) = 75% |
E is True |
|
F |
I will not win the lottery on Saturday |
P(F) = 75% |
F is True |
|
G |
I will not win the lottery on Sunday |
P(G) = 75% |
G is True |
|
H |
I will win the lottery next Monday |
P(H) = 75% |
H is False |
|
… |
|
|
|
|
~A |
I will win the lottery on Monday |
P(~A) = 50% |
~A is False |
|
~B |
I will win the lottery on Tuesday |
P(~B) = 34% |
~B is False |
|
~C |
I will win the lottery on Wednesday |
P(~C) = 34% |
~C is False |
|
… |
|
|
|
Look carefully at this table. I've thrown in a particular mix of predictions that I will or will not win the lottery on certain days, in order to use my extreme certainty about the result to generate a particular mix of correct and incorrect predictions.
To make things even easier for me, I’m not even planning to buy any lottery tickets. Knowing this information, an honest estimate of the odds of me winning the lottery are astronomically small. The odds of winning the lottery are about 1 in 14 million (for the Canadian 6/49 lottery). I’d have to win by accident (one of my relatives buying me a lottery ticket?). Not only that, but since the lottery is only held on Wednesday and Saturday, that makes most of these scenarios even more implausible since the lottery corporation would have to hold the draw by mistake.
I am confident I could make at least 1 billion similar statements of this exact nature and get them all right, so my true confidence must be upwards of (100% - 0.0000001%).
If I assemble 50 of these types of strategically-underconfident predictions (and their 50 opposites) and plot them on a graph, here’s what I get:

You can see that the problem with cheating doesn’t occur only at 50%. It can occur anywhere!
But here’s the trick: The log scoring algorithm rates me -0.37. If I had made the same 100 predictions all at my true confidence (99.9999999%), then my score would have been -0.000000001. A much better score! My attempt to cheat in order to make a pretty graph has only sabotaged my score.
By the way, what if I had gotten one of those wrong, and actually won the lottery one of those times without even buying a ticket? In that case my score is -0.41 (the wrong prediction had a probability of 1 in 10^9 which is about 1 in e^21, so it’s worth -21 points, but then that averages down to -0.41 due to the 49 correct predictions that are collectively worth a negligible fraction of a point).* Not terrible! The log scoring rule is pretty gentle about being very badly wrong sometimes, just as long as you aren’t infinitely wrong. However, if I had been a little less confident and said the chance of winning each time was only 1 in a million, rather than 1 in a billion, my score would have improved to -0.28, and if I had expressed only 98% confidence I would have scored -0.098, the best possible score for someone who is wrong one in every fifty times.
This has another important ramification: If you're going to honestly test your calibration, you shouldn't pick the predictions you'll make. It is easy to improve your score by throwing in a couple predictions that you are very certain about, like that you won't win the lottery, and by making few predictions that you are genuinely uncertain about. It is fairer to use a list of propositions that is generated by somebody else, and then pick your probabilities. Scott demonstrates his honesty by making public predictions about a mix of things he was genuinely uncertain about, but if he wanted to cook his way to a better score in the future, he would avoid making any predictions at the 50% category that he wasn't forced to.
Input and comments are welcome! Let me know what you think!
* This result surprises me enough that I would appreciate if someone in the comments can double-check it on their own. What is the proper score for being right 49 times with 1-1 in a billion certainty, but wrong once?
Hi Mark,
Thanks for your well-considered post. Your departure will be a loss for the community, and sorry to see you go.
I also feel that some of the criticism you're posting here might be due to a misunderstanding, mainly regarding the validity of thought experiments, and of reasoning by analogy. I think both of these have a valid place in rational thought, and have generally been used appropriately in the material you're referring to. I'll make an attempt below to elaborate.
Reasoning by analogy, or, the outside view
What you call "reasoning by analogy" is well described in the sequence on the outside view. However, as you say,
The fundamental mistake here is that reasoning by analogy is not in itself a sufficient explanation for a natural phenomenon, because it says nothing about the context sensitivity or insensitivity of the original example and under what conditions it may or may not hold true in a different situation.
This is exactly the same criticism that Eliezer has of outside-view thinking, detailed in the sequences!
In outside view as a conversation halter:
Of course Robin Hanson has a different idea of what constitutes the reference class and so makes a rather different prediction - a problem I refer to as "reference class tennis"[...] But mostly I would simply decline to reason by analogy, preferring to drop back into causal reasoning in order to make weak, vague predictions.
You're very right that the uncertainty in the AI field is very high. I hope that work is being done to get a few data points and narrow down the uncertainty, but don't think that you're the first to object to an over-reliance on "reasoning by analogy". It's just that when faced with a new problem with no clear reference class, it's very hard to use the outside vew, but unfortunately also hard to trust predictions from a model which has sensitive parameters with high uncertainties.
Thought experiments are a tool of deduction, not evidence
We get instead definitive conclusions drawn from thought experiments only.
This is similar to complaining about people arriving at definitive conclusions drawn from mathematical derivation only.
I want to stress that this is not a problem in most cases, especially not in physics. Physics is a field in which models are very general and held with high confidence, but often hard to use to handle complicated cases. We have a number of "laws" in physics that we have fairly high certainty of; nonetheless, the implications of these laws are not clear, and even if we believe them we may be unsure of whether certain phenomena are permitted by these laws or not. Of course we also do have to test our basic laws, which is why we have CERN and such, especially because we suspect they are incomplete (thanks in part to thought experiments!).
A thought experiment is not data, and you do not use conclusions from thought experiments to update your beliefs as though the thought experiment were producing data. Instead, you use thought experiments to update your knowledge of the predictions of the beliefs you already have. You can't just give an ordinary human the laws of physics written down on a piece of paper and expect them to immediately and fully understand the implications of the truth of those laws, or even to verify that the laws are not contradictory.
Thus, Einstein was able to use thought experiments very profitably to identify that the laws of classical mechanics (as formulated at the time) led to a contradiction with the laws of electrodynamics. No experimental evidence was needed; the thought experiment is a logical inference procedure which identifies one consequence of Maxwell's equations being that light travels at speed 'c' in all reference frames, and shows that to be incompatible with Galilean relativity. A thought experiment, just like a mathematical proof-by-contradiction, can be used to show that certain beliefs are mutually inconsistent and one must be changed or discarded.
Thus, I take issue with this statement:
(thought experiments favored over real world experiments)
Thought experiments are not experiments at all, and cannot even be compared to experiments. They are a powerful tool for exploring theory, and should be compared to other tools of theory such as mathematics. Experiments are a powerful tool for checking your theory, but experiments alone are just data; they won't tell you what your theory predicted, or whether your theory is supported or refuted by the data. Theory is a powerful tool for exploring the spaces of mutually compatible beliefs, but without data you cannot tell whether a theory has relevance to reality or not.
It would make sense to protest that thought experiments are being used instead of math, which some think is a more powerful tool for logical inference. On the other hand, math fails at being accessible to a wide audience, while thought experiments are. But the important thing is that thought experiments are similar to math in their purpose. They are not at all like experiments; don't get their purposes confused!
Within Less Wrong, I have only ever seen thought experiments used for illustrating the consequences of beliefs, not for being taken as evidence. For example, the belief that "humans have self-sabotaging cognitive flaws, and a wide variation of talents" and the belief that "humans are about as intelligent as intelligent things can get" would appear to be mutually incompatible, but it's not entirely obvious and a valid space to explore with thought experiments.
The discussion itself is a good case study in complex communication. Look at the levels of indirection:
- A: What is true about growth, effort, ability, etc?
- B: What do people believe about A?
- C: What is true about people who hold the different beliefs in B?
- D: What does Dweck believe about C (and/or interventions to change B)?
- E: What does Scott believe about C (by way of discussing D, and also C, and B, and A)?
Yikes! Naturally, it's hard to keep these separate. From what I can tell, the conversation is mostly derailing because people didn't understand the differences between levels at all, or because they aren't taking pains to clarify what level they are currently talking about. So everyone gets that E is the "perspective" level, and that D is the contrasting perspective, but you have plenty of people confusing (at least in discussion) levels ABC, or A and BC, which makes progress on D and E impossible.
Upvoted because I think this is a really good point, which is almost totally missed in the surrounding discussion.
For example, it's interesting to see that a lot of the experiments were directly attempting to measure C: The researcher tries to persuade the child to believe something about A, and then measures their performance. But then that research gets translated in the lay press as demonstrating something about A!
I feel that if emr's post were put as a header to Scott's, the amount of confusion in the rebuttals would be reduced considerably.
Incidentally, I've observed a similarly common difficulty understanding the distinction between derivative orders of a quantity, eg. distinguishing between something "being large" vs. "growing fast", etc. This seems less common among people trained in calculus, but even then, often people confuse these. I see it all the time in the press, and I wonder if there is a similar level-hopping neural circuit at work.
For example, there are three or four orders of differentiation that exist in common discussion of climate change, eg:
- A: Scientists recommend that atmospheric CO2 be kept below 350 ppm.
- B: Canada emits only about half a gigaton of CO2 per year, whereas China emits nearly twenty times that much.
- BB: Canada emits 15.7 tons of CO2 annually per capita, among the highest in the world, whereas China emits less than half of that amount per capita.
- C: China's emissions are among the fastest-growing in the world, up by nearly 500 million tonnes over last year. Canada decreased its emissions by 10 million tonnes over the same period.
- D: The growth in Canadian oil-industry emissions could slow if low prices force the industry to reduce expansion plans.
Et cetera...
Ostensibly what actually matters is A, which is dependent on the fourth integral of what is being discussed in D! People end up having a very hard time keeping these levels distinct, and much confusion and miscommunication ensues.
I wonder -- do you think students of calculus will be better at understanding the levels of indirection in either case?
I've had glasses for about a decade now, and my eyes have remained stable for over a year. The MSG recommendation had popped up a few times, so I'll try cutting it out.
Sadly, my migraines have evolved to some beast that constantly gnaws away at me, so I can't really take preventative measures. I would just love to have a break for a few days so I can get the edge in this struggle.
Thanks for replying; every data point helps.
That sounds beyond terrible. I really wish I could be of more help. I know exactly how awful it is to have a migraine for one hour, but I cannot fathom what it must be like to live with it perpetually.
Well, here is some general Less Wrong-style advice which I can try to offer. The first thing is that since you have been coping with this for so long, maybe you don't have a clear feeling for how much better life would be without this problem. If these migraines are as bad for you as I imagine they are, then I would recommend that you make curing yourself almost your first priority in life, as an instrumental goal for anything else that you care about.
I agree that it is worse than blindness. If I went blind, I would learn to cope and not invest all of my energies into restoring my vision. But if I were you, I would classify curing your migraines as a problem deserving an extraordinary effort as if your life itself were at stake ( http://lesswrong.com/lw/uo/make_an_extraordinary_effort/ ). That means going beyond the easy and obvious solutions that you have already tried (such as medication) and doing something out of the ordinary to succeed.
Treat this as mere speculation, since I'm not up-to-date on the migraine literature anymore... but an example of out-of-the-ordinary solutions, you could try renting a different house for a month, moving to a different city, or even moving to a totally different country for a couple weeks. The thinking being that if there is an environmental trigger, a shotgun approach that changes as many environmental variables at once might solve this. For example, if it turned out you have a sensitivity to something in your house, moving house for a while might work. If it turned out to be air pollution in your city, then moving to a cleaner environment might fix it. Unfortunately, unless the state of migraine knowledge has advanced a lot, I think the space of possible hypotheses is huge. So...
Basically, I'm suggesting that you might want to try something on the scale of a month-long trip to live with Buddhist monks in Nepal, or on a Kibbutz in Israel, or to a fishing village in Newfoundland, or something. Changing at once basically everything about your lifestyle from diet, exercise, environment, sleep schedule, electronic devices, and interpersonal interactions. It's not the kind of solution most people would try, especially since the daily responsibilities of life (work, family, money, etc) always seem to take priority, and nobody has the time to just go and leave for a month. Especially since you have a severe impairment which probably makes all the other things take even more time and effort. But that's the difference between making a desperate effort, and "trying to try" just to satisfy yourself that you've done as much as anyone else would do. If curing your migraines is your top priority in life, as I think it should be right now, then it's worth investing a year of your time.
Anyway, that's the only other thought I have. You should try the easy things first of course (starting with MSG), but before you give up make sure you understand how wide the space of possible solutions might be, and how many different lines of attack might exist that haven't even been thought of yet.
I suffered from severe migraines for most of my life, but they were much more frequent when I was younger -- about two or three times a week. During high school they decreased in frequency a lot, to maybe once a month or once every two months, and now that I am in my late '20s, I only get migraines once or twice a year. Unfortunately I can't give you much rationalist advice; although I discussed migraines with my doctor and we worked together to find a solution, to my understanding it's still not a scientifically well-understood problem. So all I can tell you is what I've found worked for me.
My symptoms were usually a severe throbbing pain around my head somewhere between my eyes and ears, light and sound sensitivity, nausea, and a strange kind of dizziness. Reorienting my head would affect the pain a lot, and so during a migraine I usually end up with my head tilted to one side.
I couldn't find any good way to rid myself of a migraine once it had started, but Advil worked well for dulling the pain, after about 15 minutes. I'd just lie in a dark room until it passed, which sometimess took up to an hour.
Preventative measures worked the best.. I did learn some warning signs (well, feelings that are a bit hard to put into words) that a migraine might be impending, and after I noticed them I would try to remove myself to a quiet, low-stimulation environment, and make sure that I was well-hydrated and calm. That would often help avoid the migraine becoming severe. I found that migraines were always much, much worse when I was dehydrated, which as it happens is an easy condition not to notice.
I don't know exactly what has caused the large reduction in frequency of attacks for me, although I'm thankful for it. It could just be that my body grew and changed and now I don't get them as much. It also could be dietary changes -- I used to eat very different foods when I was a kid than I do now, including a lot of instant ramen noodles and other salty foods, and candy, whereas now I keep a much healthier diet. I cut out the ramen noodles at my doctor's suggestion that MSG might be a trigger, and that was definitely correlated with a major decrease in occurrence frequency, but it wasn't a controlled experiment and I changed a lot of other things too. I wouldn't be surprised if the dietary changes helped, but it's hard to be certain.
Good luck. Migraines are bad, bad, bad and I hope you can get rid of them.
*Edit: By the way, have you been to an optometrist lately? It might be that you are suffering from low-level eye strain and are in need of glasses.
The sentence after the Mere Exposure Effect is introduced does not quite parse. Might want to double check it.
A possible distinction between status and dominance: You are everybody's favourite sidekick. You don't dominate or control the group, nor do you want to, nor do you even voice any opinions about what the group should do. You find the idea of telling other people what to do to be unpleasant, and avoid doing so whenever you can. You would much rather be assigned complex tasks and then follow them through with diligence and pride. Everyone wants you in the group, they genuinely value your contribution, they care about your satisfaction with the project, and want you to be happy and well compensated.
By no means would I consider this role dominant, at least not in terms of controlling other people. (You might indeed be the decisive factor in the success of the group, or the least replaceable member). But it is certainly a high-status role; you are not deferred to but you are respected, and you are not treated as a replaceable cog. The president or boss knows your name, knows your family, and calls you first when something needs to be done.
I think many people aspire to this position and prefer it over a position of dominance.
A low-status person on this scale would be somebody ignored, disrespected, or treated as replaceable and irrelevant. You are unworthy of attention. When it is convenient others pretend you don't exist, and your needs, desires, and goals are ignored.
I think almost everyone desires high status by this measure. It is very different than dominance.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
http://www.yudkowsky.net/rational/technical/
Thanks, that is indeed a better scoring rule. I did check first to see if the squared error was proper, and it was (since these are binary predictions), but the log rule makes much more sense. I will update the above later when I get home.