Previously: Raising the waterline, see also: 1001 PredictionBook Nights (LW copy), Techniques for probability estimates
Low waterlines imply that it's relatively easy for a novice to outperform the competition. (In poker, as discussed in Nate Silver's book, the "fish" are those who can't master basic techniques such as folding when they have a poor hand, or calculating even roughly the expected value of a pot.) Does this apply to the domain of making predictions? It's early days, but it looks as if a smallish set of tools - a conscious status quo bias, respecting probability axioms when considering alternatives, considering references classes, leaving yourself a line of retreat, detaching from sunk costs, and a few more - can at least place you in a good position.
A bit of backstory
Like perhaps many LessWrongers, my first encounter with the notion of calibrated confidence was "A Technical Explanation of Technical Explanation". My first serious stab at publicly expressing my own beliefs as quantified probabilities was the Amanda Knox case - an eye-opener, waking me up to how everyday opinions could correspond to degrees of certainty, and how these had consequences. By the following year, I was trying to improve my calibration for work-related purposes, and playing with various Web sites, like PredictionBook or Guessum (now defunct).
Then the Good Judgment Project was announced on Less Wrong. Like several of us, I applied, unexpectedly got in, and started taking forecasting more seriously. (I tend to apply myself somewhat better to learning when there is a competitive element - not an attitude I'm particularly proud of, but being aware of that is useful.)
The GJP is both a contest and an experimental study, in fact a group of related studies: several distinct groups of researchers (1,2,3,4) are being funded by IARPA to each run their own experimental program. Within each, small or large number of participants have been recruited, allocated to different experimental conditions, and encouraged to compete with each other (or even, as far as I know, for some experimental conditions, collaborate with each other). The goal is to make predictions about "world events" - and if possible to get them more right, collectively, than we would individually.1
Tool 1: Favor the status quo
The first hint I got that my approach to forecasting needed more explicit thinking tools was a blog post by Paul Hewitt I came across late in the first season. My scores in that period (summer 2011 to spring 2012) had been decent but not fantastic; I ended up 5th on my team, which itself placed quite modestly in the contest.
Hewitt pointed out that in general, you could do better than most other forecasters by favoring the status quo outcome.2 This may not quite be on the same order of effectiveness as the poker advice to "err on the side of folding mediocre hands more often", but it makes a lot of sense, at least for the Good Judgment Project (and possibly for many of the questions we might worry about). Many of the GJP questions refer to possibilities that loom large in the media at a given time, that are highly available - in the sense of the availability heuristic. This results in a tendency to favor forecasts of change from status quo.
For instance, one of the Season 1 questions was "Will Marine LePen cease to be a candidate for President of France before 10 April 2012?" (also on PredictionBook). Just because the question is being asked doesn't mean that you should assign "yes" and "no" equal probabilities of 50%, or even close to 50%, any more than you should assign 50% to the proposition "I will win the lottery".
Rather, you might start from a relatively low prior probability that anyone who undertakes something as significant as a bid for national presidency would throw in the towel before the contest even starts. Then, try to find evidence that positively favors a change. In this particular case, there was such evidence - the National Front, of which she was the candidate, consistently reports difficulties rounding up the endorsements required to register a candidate legally. However, only once in the past (1981) had this resulted in their candidate being barred (admittedly a very small sample). It would have been a mistake to weigh that evidence excessively. (I got a good score on that question, compared to the team, but definitely owing to a "home ground advantage" as a French citizen rather than my superior forecasting skills.)
Tool 2: Flip the question around
The next technique I try to apply consistently is respecting the axioms of probability. If the probability of event A is 70%, then the probability of not-A is 30%.
This may strike everyone as obvious... it's not. In Season 2, several of my team-mates are on record as assigning a 75% probability to the proposition "The number of registered Syrian conflict refugees reported by the UNHCR will exceed 250,000 at any point before 1 April 2013".
That number was reached today, six months in advance of the deadline. This was clear as early as August. The trend in the past few months has been an increase of 1000 to 2000 a day, and the UNHCR have recently provided estimates that this number will eventually reach 700,000. The kicker is that this number is only the count of people who are fully processed by the UNHCR administration and officially in their database; there are tens of thousands more in the camps who only have "appointments to be registered".
I've been finding it hard to understand why my team-mates haven't been updating to, maybe not 100%, but at least 99%; and how one wouldn't see these as the only answers worth considering. At any point in the past few weeks, to state your probability as 85% or 91% (as some have quite recently) was to say, "There is still a one in ten chance that the Syrian conflict will suddenly stop and all these people will go home, maybe next week?."
This is kind of like saying "There is a one in ten chance Santa Claus will be the one distributing the presents this year." It feels like a huge "clack".
I can only speculate as to what's going on there. Queried for a probability, people are translating something like "Sure, A is happening" into a biggish number, and reporting that. They are totally failing to flip the question around and explicitly consider what it would take for not-A to happen. (Perhaps, too, people have been so strongly cautioned by cautions, from Tetlock and others, against being overconfident that they reflexively shy away from the extreme numbers.)
Just because you're expressing beliefs as percentages doesn't mean that you are automatically applying the axioms of probability. Just because you use "75%" as a shorthand for "I'm pretty sure" doesn't mean you are thinking probabilistically; you must train the skill of seeing that for some events, its complement "25%" also counts as "I'm pretty sure". The axioms are more important than the use of numbers - in fact for this sort of forecast "91%" strikes me as needlessly precise; increments of 5% are more than enough, away from the extremes.
Tool 3: Reference class forecasting
The order in which I'm discussing these "basics of forecasting" reflects not so much their importance, as the order in which I tend to run through them when encountering a new question. (This might not be the optimal order, or even very good - but that should matter little if the waterline is indeed low.)
Using reference classes was actually part of the "training package" of the GJP. From the linked post comes the warning that "deciding what's the proper reference class is not straightforward". And in fact, this tool only applies in some cases, not systematically. One of our recently closed questions was "Will any government force gain control of the Somali town of Kismayo before 1 November 2012?". Clearly, you could spend quite a while trying to figure out an appropriate reference class here. (In fact, this question also stands as a counter-example to the "Favor status quo" tool, and flipping the question around might not have been too useful either. All these tools require some discrimination.)
On the other hand, it came in rather handy in assessing the short-term question we got late september: "What change will occur in the FAO Food Price index during September 2012?" - with barely two weeks to go before the FAO was to post the updated index in early October. More generally, it's a useful tool when you're asked to make predictions regarding a numerical indicator, for which you can observe past data.
The FAO price data can be retrieved as a spreadsheet (.xsl download). Our forecast question divided the outcomes into four: A) an increase of 3% or more, B) an increase of less than 3%, C) a decrease of less than 3%, D) a decrease of more than 3%, E) "no change" - meaning a change too small to alter the value rounded to the nearest integer.
It's not clear from the chart that there is any consistent seasonal variation. A change of 3% would have been about 6.4 points; since 8/2011 there had been four month-on-month changes of that magnitude, 3 decreases and 1 increase. Based on that reference class, the probability of a small change (B+C+E) came out to about 2/3. The probability for "no change" (E) to 1/12 - the August price was the same as the July price. The probability for an increase (A+B), roughly the same as for a decrease (C+D). My first-cut forecast allocated the probability mass as follows: 15/30/30/15/10.
However, I figured I did need to apply a correction, based on reports of a drought in the US that could lead to some food shortages. I took 10% probability mass from the "decrease" outcomes and allocated it to the "increase" outcomes. My final forecast was 20/35/25/10/10. I didn't mess around with it any more than that. As it turned out, the actual outcome was B! My score was bettered by only 3 forecasters, out of a total of 9.
Next up: lines of retreat, ditching sunk costs, loss functions
This post has grown long enough, and I still have 3+ tools I want to cover. Stay tuned for Part 2!
1 The GJP is being run by Phil Tetlock, known for his "hedgehog and fox" analysis of forecasting. At that time I wasn't aware of the competing groups - one of them, DAGGRE, is run by Robin Hanson (of OB fame) among others, which might have made it an appealing alternate choice if I'd know about it.
2 Unfortunately, the experimental condition Paul belonged to used a prediction market where forecasters played virtual money by "betting" on predictions; this makes it hard to translate the numbers he provides into probabilities. The general point is still interesting.
You're getting this from the "refinement" part of the calibration/refinement decomposition of the Brier score. Over time, your score will end up much higher than others' if you have better refinement (e.g. from "inside information", or from a superior methodology), even if everyone is identically (perfectly) calibrated.
This is the difference between a weather forecast derived from looking at a climate model, e.g. I assign 68% probability to the proposition that the temperature today in your city is within one standard deviation of its average October temperature, and one derived from looking out the window.
ETA: what you say about my using an assumption is not correct - I've only been making the forecast well-specified, such that the way you said you allocated your probability mass would give us a proper loss function, and simplifying the calculation by using a uniform distribution for the rest of your 90%. You can compute the loss function for any allocation of probability among outcomes that you care to name - the math might become more complicated, is all. I'm not making any assumptions as to the probability distribution of the actual events. The math doesn't, either. It's quite general.
I can still make 100000 lottery predictions, and get a good score. I look for a system which you cannot trick in that way. Ok, for each prediction, you can subtract the average score from your score. That should work. Assuming that all other predictions are rational, too, you get an expectation of 0 difference in the lottery predictions.
I think "impact here (10% confidence), no impact at that place (90% confidence)" is quite specific. It is a binary event.