TL;DR: Prediction & calibration parties are an exciting way for your EA/rationality/LessWrong group to practice rationality skills and celebrate the new year.

On December 30th, Seattle Rationality had a prediction party. Around 15 people showed up, brought snacks, brewed coffee, and spent several hours making predictions for 2017, and generating confidence levels for those predictions.

This was heavily inspired by Scott Alexander’s yearly predictions. (2014 results, 2015 results, 2016 predictions.) Our move was to turn this into a communal activity, with a few alterations to meet our needs and make it work better in a group.

Procedure:

  • Each person individually writes a bunch of predictions for the upcoming year. They can be about global events, people’s personal lives, etc.
    • If you use Scott Alexander’s system, create 5+ predictions each for fixed confidence levels (50%, 60%, 70%, 80%, 90%, 95%, etc.)
    • If you want to generate Brier scores or logarithmic scores, just do 30+ predictions at whatever confidence levels you believe.
  • Write down confidence levels for each prediction.
  • Save your predictions and put it aside for 12 months.
  • Open up your predictions and see how everyone did.

To make this work in a group, we recommend the following:

  • Don’t share your confidence intervals. Avoid anchoring by just not naming how likely or unlikely you think any prediction is.
  • Do share predictions. Generating 30+ predictions is difficult, and sharing ideas (without confidence levels) makes it way easier to come up with a bunch. We made a shared google doc, and everyone pasted some of their predictions into it.
  • Make predictions that, in a year, will verifiably have happened or not. (IE, not “the academic year will go well”, which is debatable, but “I will finish the year with a 3.5 GPA or above”.)
  • It’s convenient to assume that unless stated otherwise predictions that end by the next year (IE, "I will go to the Bay Area" means "I will go to the Bay Area at least once in 2017.") It’s also fine to make predictions that have other end dates (“I will go to EA Global this summer.”)
  • Make a bunch of predictions first without thinking too hard about how likely they are, then assign confidence levels. This post details why. You could also generate a group list of predictions, and everyone individually lists their own confidence levels.


This makes a good activity for rationality/EA groups for the following reasons:

  • Practicing rationality skills:
    • Making accurate predictions
    • Using confidence intervals
  • Accessibility
    • It’s open to many different knowledge levels. Even if you don’t know a thing about geopolitics, you can still give predictions and confidence intervals about media, sports, or your own life.
    • More free-form and less intimidating than using a prediction market. You do not have to know about the details of forecasting to try this.
  • Natural time and recurring activity
    • You could do this at any point during the year, but doing it at the start of the year seems appropriate for ringing in the new year.
    • In twelve months, you have an automatic new activity, which is coming back together and checking everybody’s predictions from last year. Then you make a new set of predictions for next year. (If this falls through for some reason, everyone can, of course, still check their predictions on their own.)
  • Fostering a friendly sense of competitiveness
    • Everyone wants to have the best calibration, or the lowest Brier score. Everyone wants to have the most accurate predictions!


Some examples of the predictions people used:

  • Any open challenges from the Good Judgment Project.
  • I will switch jobs.
  • I will make more than $1000 money in a way that is different from my primary job or stock.
  • I will exercise 3 or more times per week in October, November, December.
  • I’ll get another tattoo.
  • Gay marriage will continue to be legal in Washington state.
  • Gay marriage will continue to be legal in all 50 states.
  • I will try Focusing at least once.
  • I will go to another continent.
  • CRISPR clinical trials will happen on humans in the US.
  • A country that didn’t previously have nuclear weapons will acquire them.
  • I will read Thinking Fast and Slow.
  • I will go on at least 3 dates.

Also relevant:

  • 16 types of useful predictions
  • Brier values and graphs of ‘perfect’ vs. actual scores will give you different information. Yvain writes about the differences between these. Several of us did predictions last year using the Scott Alexander method (bins at fixed probabilities), although this year, everybody seems to have used continuous probabilities. The exact method by which we’ll determine how well-calibrated we were will be left to Seattle Rationality of 2018, but will probably include Brier values AND something to determine calibration.
New Comment
14 comments, sorted by Click to highlight new comments since: Today at 3:33 PM

If you use Scott Alexander’s system, create 5+ predictions each for fixed confidence levels (50%, 60%, 70%, 80%, 90%, 95%, etc.)

Why? It seems to me more productive to try to find confidence levels for predictions than to find predictions for confidence levels.

Additionally the book Superforecasting suggets that Superforcasters are people who make more granular predictions then 50%, 60%, 70%, 80%, 90%, 95%, etc.

You can do calibration and accuracy. You can start with predictions of arbitrary granularity and then force them into whatever boxes you want.

For calibration, it isn't very useful to score events at 50%. Instead of making boxes of 50, 60, 70, 80, 90, 95, 99%, you should instead do something like 55, 70, 80, 90, 95, 99%. Taking an event that you "really" think is 50/50 and forcing yourself to choose a side to make it 45/55 is no worse than taking an event that you think is 45/55 and forcing it to be either 50 or 60%.

Also, the jump from 95 to 99 is pretty big. Better to add an intermediate category of 97 or 98. Or just replace 99 with 98.


I think 60, 80, 90, 95, 98 would be a good set of bins for beginners.

50% predictions can be useful if you are systematic about which option you count as "yes". e.g., "I estimate a 50% chance that I will finish writing my book this year" is a meaningful prediction. If I am subject to standard biases, then we would expect this to have less than a 50% chance of happening, so the outcomes of predictions like this provide a meaningful test of my prediction ability.

2 conventions you could use for 50% predictions: 1) pose the question such that "yes" means an event happened and "no" is the default, or 2) pose the question such that "yes" is your preferred outcome and "no" is the less desirable outcome.

Actually, it is probably better to pick one of these conventions and use it for all predictions (so you'd use the whole range from 0-100, rather than just the top half of 50-100). "70% chance I will finish my book" is meaningfully different than "70% chance I will not finish my book"; we are throwing away information about possible miscalibrated by treating them both merely as 70% predictions.

Even better, you could pose the question however you like and also note when you make your prediction 1) which outcome (if either) is an event rather than the default and 2) which outcome (if either) you prefer. Then at the end of the year you could look at 3 graphs, one which looks at whether the outcome that you considered more likely occurred, one that looks at whether the (non-default) event occurred, and one which looks at whether your preferred outcome occurred.

Sorry, I misread your comment originally. You were careful to say that you were talking about 3 different biases, while most people say that there is a right way to orient each question.

But you weren't careful to say that calibration — the measure of over- and under-confidence — is different from bias. There are four questions here. Introducing new questions that make sense at 50% is irrelevant to the fact that calibration doesn't make sense at 50%. If we are just doing calibration, some of our tests are wasted. If we add a test of a bias, that part of the calibration test is still wasted. If we force the bin away from 50%, then that improves the calibration test. Moreover, I don't think that it harms the test of bias.

Ideally, we would look at everything, but is it worth the effort? If we start with one thing, what is most important? I think that overconfidence is the biggest problem and one should start there. In some sense the annotations you suggest are not much more work, but in making the difference between doing and not doing, I think small increments matter.

(While most people are overconfident and calibration exercises are mainly about reducing overconfidence, the problem of 50% is actually a problem of underconfidence.)

Most questions don't have a preferred direction. Look at Scott's predictions. Which direction should you point each one?

Most people don't make enough predictions to get a statistically significant difference between the two sides of the scale. And even if they do, their bias to the extremes ("overconfidence") swamps the effect.

Just looking at the 50% questions, here is how I would score 1) if either direction is an event rather than the default and 2) if either direction is probably preferred by Scott:

US unemployment to be lower at end of year than beginning: 50%

Neither direction is an event, Yes is preferred.

SpaceX successfully launches a reused rocket: 50%

Yes is an event, Yes is preferred.

California’s drought not officially declared over: 50%

No is an event, No is preferred.

At least one SSC post > 100,000 hits: 50%

Yes is an event, Yes is preferred.

UNSONG will get > 1,000,000 hits: 50%

Yes is an event, Yes is preferred.

UNSONG will not miss any updates: 50%

No is an event, Yes is preferred.

I will be involved in at least one published/accepted-to-publish research paper by the end of 2016: 50%

Yes is an event, Yes is preferred.

[Over] 10,000 Twitter followers by end of this year: 50%

Yes is an event, Yes is preferred.

I will not get any new girlfriends: 50%

No is an event, perhaps No is preferred.

I will score 95th percentile or above in next year’s PRITE: 50%

Yes is an event, Yes is preferred.

I will not have any inpatient rotations: 50%

No is an event, perhaps Yes is preferred.

I get at least one article published on a major site like Huffington Post or Vox or New Statesman or something: 50%

Yes is an event, Yes is preferred.

I don’t attend any weddings this year: 50%

No is an event, perhaps No is preferred.

Scott would know better than I do, and he also could have marked a subset that he actually cared about.

Including the "perhaps"es, I count that 7/12 happened in the preferred direction, and 5/11 of the events happened. With this small sample there's no sign of optimism bias, and he's also well-calibrated on whether a non-default event will happen. Obviously you'd want to do this with the full set of questions and not just the 50% ones to get a more meaningful sample size.

US unemployment to be lower at end of year than beginning: 50%

Neither direction is an event

Well, to be pedantic if the US unemployment was exactly the same at the end of the year as at the beginning the prediction as worded by Scott would be false, so it could be argued that Yes is an event. (But the same would apply if he had written "higher" instead of "lower".)

I would imagine that at the 50% level, you can put down a prediction in the positive or negative phrasing, and since it'll be fixed at the beginning of the year (IE, you won't be rephrasing it six months in), you should expect 50% of them to end up happening either way. Right?

(50% predictions are meaningless for calculating Brier scores, but seem valuable for general calibration levels. I suppose forcing them to 45/55% so that you can incorporate them in Brier scores / etc isn't a bad idea. I'm not much of a statistician. Is that what you were saying, Douglas_Knight?)

The 99%/97% thing is true in that you're jumping from one probability to a probability that's 3 times as high, but it seems practically less necessary in that A) if you're making fewer than 30 predictions at that interval, you shouldn't expect any of them to be true, and B) I have a hard time mentally distinguishing 97% and 99% chances, and would expect other people to be similarly bad at it (unless they practiced or did some rigorous evaluation of the evidence.) I'm not sure how much credence I should lend to this.

You seem to mix up calibration and Brier scores.

Your first paragraph is correct. That is calibration. That is why 50/50 items are not useful for calibration. If you get less than 90% of your 90% items correct, you are a normal overconfident person. If your 50/50 items are not 50% correct, something odd is going on, like you are abnormally biased by the way questions are phrased.

Brier scores allow any input. 50% is a useful prediction for Brier scores. If you say that the French incumbent has a 50% chance of winning the election, that doesn't affect your calibration, but it is bad for your Brier score.

Yes, I see - it seems like there are two ways to do this exercise.

1) Everybody writes their own predictions and arranges them into probability bins (either artificially after coming up with them, or just writing 5 at 60%, 5 at 70%, etc.) You then check your calibration with a graph like Scott Alexander's.

2) Everybody writes their estimations for the same set of predictions - maybe you generate 50 as a group, and everyone writes down their most likely outcome and how confident they are in it. You then check your Brier score.

Both of these seem useful for different things - in 2), it's a sort of raw measure of how good at making accurate guesses you are. Lower confidence levels make your score worse. In 1), you're looking at calibration across probabilities - there are always going to be things you're only 50% or 70% sure about, and making those intervals reflect reality is as important as things you're 95% certain on.

I will edit the original post (in a bit) to reflect this.

Right, the two measures are calibration and accuracy. But calibration is part of accuracy.

Lower confidence levels make your score worse

Only if you guessed right. If you guessed wrong, lower confidence makes your score better. Under a "proper" scoring rule like Brier, you get the best possible score by honestly describing your uncertainty. Thus calibration — whether your 70% really happens 70% of the time — is a component of Brier score. If you improve your calibration, your Brier score will improve.

I think one should work on calibration before working on accuracy. Its mainly about knowing what 70% really means. Also, you can judge calibration on any set of questions, so you can tell that you are improving. While it is hard to compare Brier scores across questions. All you can do is compete with other people (or algorithms). Some questions are harder than others and that means that you must get worse Brier scores on them. But that doesn't mean that you will not be calibrated on hard questions, it just means that you should be less confident.

For calibration, it isn't very useful to score events at 50%

Why?

People answering a diverse set of questions are always perfectly calibrated at 50%. But in your case you, answering the same question over and over — will it pass the test — you might be systematically overconfident and so it is useful.

If you're having a communal event anyway, maybe it makes sense to watch a video together (movie, TED talk, debate, whatever), pause it every few minutes and make predictions about what happens in the rest of the video. Focus on short-term predictions like "in the next two minutes..." so you get even quicker results.

You get a tight feedback loop and several rounds of it. That allows you to actually try different strategies and see how they fare.

I imagine it would also be quite fun.