I don't think I had seen that, and wow, it definitely covers basically all of what I was thinking about trying to say in this post, and a bit more.
I do think there is something useful to say about how reference class combinations work, and using causal models versus correlational ones for model combination given heterogeneous data - but that will require formulating it more clearly than I have in my head right now. (I'm working on two different projects where I'm getting it straighter in my head, which led to this post, as a quick explanation why people need to stop using "reference classes" that don't work well because they can't find a better one, as if "reference class" is an argument about correctness of a prediction.)
Nice examples and discussion.
Quick points:
I like the idea of "The the fallacy". Whenever there's a phrase called "The X", that presupposes that there is one X, and that's typically not true.
In this case, the ideas of the reference class or the outside view are dramatic simplifications. These seem like weak heuristics that often are valuable, but are difficult to translate to a construct that you can apply intense reasoning on. There's no great formal definition yet, and until there is, trying to do careful analysis seems challenging to me.
I agree that "considering multiple models" is generally best, where possible. It's hard to argue against this though.
Another in-the-field example of differing reference class intuitions here, on the Metaculus question:
Will Ghislaine Maxwell be alive on 1 January 2021?
The other commentator started with a prior of actuarial tables on death rates of 58 year old women in the USA, and argued that going from a base rate of 0.3% to 10% means a 33.3x increase in log-odds, which is implausibly high given the evidence entailed.
I thought actuarial tables were not a plausibly good base rate to start from, since most of the Ghislaine Maxwell-relevant bits are not from possibility of natural death.
Hopefully the discussion there is helpful for some lesswrong readers in understanding how different forecasters' intuitions clash "in practice."
Isn't this just messing up Bayes rule on their part? AFAIU, the multiplicative increase in the log odds is not particularly meaningful [Edit: I'm currently interpreting the downvote as either my explanation not being sufficiently charitable or me being wrong about the multiplicative increase in log odds. Would be down to hear more about the mistake I'm making]
Some scattered thoughts:
1. I think it's very good to consider many different outside views for a problem. This is why I considered section 2.1 of Yudkowsky's Intelligence Explosion Microeconomics to be frustrating/a weak man, because I think it's plausibly much better to ensemble a bunch of weak outside views than to use a single brittle outside view.
"Beware the man of one reference class" as they say.
2. One interesting (obvious?) note on base rates that I haven't seen anybody else point out: across time, you can think of "base rate forecasting" as just taking the zeroth derivative (while linear regression is a first derivative, etc).
3.
So which reference class is correct? In my (inside) view as a superforecaster, this is where we turn to a different superforecasting trick, about considering multiple models. As the saying goes, hedgehogs know one reference class, but foxes consult many hedgehogs.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you've considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As you've illustrated in some examples above, sometimes the final ensemble is composed of practically only one model!
4. I suspect starting with good meta-priors (in this case, good examples of reference classes to start investigating) is a substantial fraction of the battle. Often, you can have good priors even when things are very confusing.
5. One thing I'm interested in is how "complex" do you expect a reasonably good forecast to be. How many factors go into the final forecast, how complex the interactions between the parameters are, etc. I suspect final forecasts that are "good enough" are often shockingly simple, and the hard part of a forecast is building/extracting a "correct enough" simplified model of reality and getting a small amount of the appropriate data that you actually need.
Once an experienced analyst has the minimum information necessary to make an informed judgment, obtaining additional information generally does not improve the accuracy of his or her estimates. Additional information does, however, lead the analyst to become more confident in the judgment, to the point of overconfidence.
Experienced analysts have an imperfect understanding of what information they actually use in making judgments. They are unaware of the extent to which their judgments are determined by a few dominant factors, rather than by the systematic integration of all available information. Analysts actually use much less of the available information than they think they do.
There is strong experimental evidence, however, that such self-insight is usually faulty. The expert perceives his or her own judgmental process, including the number of different kinds of information taken into account, as being considerably more complex than is in fact the case. Experts overestimate the importance of factors that have only a minor impact on their judgment and underestimate the extent to which their decisions are based on a few major variables. In short, people's mental models are simpler than they think, and the analyst is typically unaware not only of which variables should have the greatest influence, but also which variables actually are having the greatest influence.
From Psychology of Intelligence Analysis, as summarized in the forecasting newsletter (emphasis mine).
If this theory is correct, or broadly correct, it'd point to human judgmental forecasting being dramatically different from dominant paradigms in statistical machine learning, where more data and greater parameters usually improve accuracy.
(I think there may be some interesting analogies with the lottery ticket hypothesis that I'd love to explore more at one point)
I suspect final forecasts that are "good enough" are often shockingly simple, and the hard part of a forecast is building/extracting a "correct enough" simplified model of reality and getting a small amount of the appropriate data that you actually need.
I think that it's often true that good forecasts can be simple, but I also think that the gulf between "good enough" and "very good" usually contains a perverse effect, where slightly more complexity makes the model perhaps slightly better in expectation, and far worse in properly estimating variance or accounting for uncertainties outside the model. That means that for the purpose of forecasting, you get much worse (brier scores) before you get better.
As a concrete example, this is seen when people forecast COVID deaths. They start with a simple linear trend, then say they don't really think it's linear, it's actually exponential, so they roughly adjust their confidence and have appropriate uncertainties around a bad model. Then they get fancier, and try using a SIR model that gives "the" answer, and the forecaster simulates 100 runs to create a distribution by varying R_0 withing a reasonable range. That gives an uncertainty range, and a very narrow resulting distribution - which the forecaster is more narrowly willing to adjust, because their model accounts for the obvious sources of variance. Then schools are reopened, or treatment methods improve, or contact rates drop as people see case counts rise, and the model's assumptions are invalidated in a different way than was expected.
I think while consulting many models is a good reminder, the hard part is choosing which model(s) to use in the end. I think your ensemble of models can often do much better than an unweighted average of all the models you've considered, since some models are a) much less applicable, b) much more brittle, c) much less intuitively plausible, or d) much too strongly correlated than other models you have.
As I said to Luke in a comment to his link to an excellent earlier post that discusses this, I think there is far more to be said about how to do model fusion, and agreed with his point in his paper that ensembles which simply average models are better than single models, but still worse than actually figuring out what each model tells you.
I feel like this post made the point "You can come up with many plausible outside views for a given question".
But it didn't really give me what I wanted when I clicked the title: discussion of how to choose between outside views (whether those be concrete heuristics or philosophical arguments).
I'd be very curious to read some of your battles stories or case studies on this from your superforecasting years.
I tried to kind-of do this with the examples here. Unfortunately, I don't actually have a clear recollection of what I did for forecasting many questions, as it's been a long time since the original tournament. And for more recent questions, I often comment of metaculus - but if longer analyses seem worthwhile, maybe that should be a post to accompany, say, my 2021 yearly predictions. (I'm naively expecting to have free time by then.)
But for most questions, the outside view is relatively easy. That doesn't mean it's the standard "use a reference class," since as I said, models dictate that. My favorite example of that is asset price forcasts, where I remember that some super-forecasters were building their own models of asset prices and probability of movement by a certain amount in a given time period from historical data, and I was just forecasting the implicit price distribution given by options prices, and absolutely dominating brier scores for those questions. (Occasionally I had very slight modifications to reflect my inside view of surprises outside that model, like stock splits and dividends, where actually modelling it correctly was annoying and not worthwhile.)
For other questions, like forecasting life-spans of dictators, the answer is fundamentally hard, and I don't think reference classes are nearly as valuable. And for COVID, I've written about my very early expectations - but maybe you think that a follow-up on why superforecasters mostly disagreed with my forecasts / I modeled things differently than them over the past 3-4 months would be interesting and useful. (I would need to check what amount of that type of discussion I can discuss publicly.)
Edit to add: There are also some interesting things to discuss around epistemic superiority, and how to deal with a relative lack of expertise in choosing between expert views or in deciding how and when it makes sense to disagree as a "general expert" with forecasting expertise. That's a bit more philosophical, but I'm hoping to discuss related issues in a paper on elicitation I am writing.
And for COVID, I've written about my very early expectations - but maybe you think that a follow-up on why superforecasters mostly disagreed with my forecasts / I modeled things differently than them over the past 3-4 months would be interesting and useful
I'd be interested in this.
Thanks for replying to my question, but although this was nicely written it doesn't really solve the problem. So I'm putting up a $100 bounty for anyone on this site (or outside it) who can solve this problem by the end of next year. (I don't expect it will work, but it might motivate some people to start thinking about it).
It seems to me that the real issue is rational weighing of reference classes when using multiple models. I want to assign them weights so that they form a good ensemble to build my forecasting distribution from, and these weights should ideally reflect my prior of them being relevant and good, model complexity, and perhaps that their biases are countered by other reference classes. In the computationally best of all possible world I go down the branching rabbit hole and also make probabilistic estimates of the weights. I could also wing it.
The problem is that the set of potential reference classes appears to be badly defined. The Tesla case potentially involves all possible subsets of stocks (2^N) over all possible time intervals (2^NT), but as the dictator case shows there is also potentially an unbounded set of other facts that might be included in selecting the reference classes. That means that we should be suspicious about having well-formed priors over the reference class set.
When I have some sensible reference classes pop up in my mind and I select from them I am doing naturalistic decision making where past experience gates availability. So while I should weigh their results together, I should be aware that they are biased in this way. I should broaden my model uncertainty for the weighing accordingly. But how much I broaden it depends on how large I allow the considerable set of potential reference classes to be, a separate meta-prior.
I don't think that weights are the right answer - not that they aren't better than nothing, but as the Tesla case shows, the actual answer is having a useful model with which to apply reference classes. For example, once you have a model of stock prices as random walks, the useful priors are over the volatility rather than price, or rather, the difference between implied options volatility and post-hoc realized volatility for the stock, and other similar stocks. (And if your model is stochastic volatility with jumps, you want priors over the inputs to that.) At that point, you can usefully use the reference classes, and which one to use isn't nearly as critical.
In general, I strongly expect that in "difficult" domains, causal understanding combined with outside view and reference classes will outperform simply using "better" reference classes naively.
There's an important piece of advice for forecasters: don't rely on your internal model of the world exclusively, and take the outside view, then adjust from there. But which view is " the" outside view? It depends on the problem - and different people might tell you different things. But if the choice of outside view is subjective, it starts to seems like inside-views all the way down.
That's where we get to base rates, which don't solve this problem, but they do highlight it nicely.
Fans of superforecasting know, in a hedgehog-like sense of knowing one thing, that the outside view, which is the base rate, which is the rate of similar events, should be our starting point. But which events are similar, and how is similarity defined? We first need to choose a reference class, based on some pre-existing idea of similarity. And in different terms, there is a reference class problem, which we evidently don't have a clear way to judge - and even as Bayesian thinkers, not only is that our problem, it's an entire bucket of different problems.
Considering a Concrete Prediction: Tesla Motors
Let's get really concrete: What will the price of Tesla stock be in 6 months?
Well, what is the reference class? In the last year, 90% of the time, the price of Tesla stock has been between $200 and $1000. But that's a really bad reference class, when the price today is $1,800. OK, but looking at the set of all stocks would be even worse - and looking at automobile stocks even worse than that. Which stocks are comparable? What about stocks with P/E ratios over 900? Or stocks with more than a half billion dollars of losses for their net income? We're getting silly here.
Maybe we shouldn't look at stock price, but should look at market capitalization? Or change in price? "Stocks that went up 9-fold over the course of a year" isn't a super helpful reference class - it has only a few examples, and they are all very different from Tesla.
Of course, none of this is helpful. What we really want is the aggregate opinion of the market, so we look at futures contracts and the implied volatility curve for options expiring in February.
That doesn't look like a reference class. But who needs an outside view, anyways?
What is a reference class?
If you want to know the probability of Kim Jung-Un staying alive, we can consult the reference class of 37 year old males in North Korea, where male life expectancy is 68. Alternatively, look at the reference class of his immediate family - his brother died at the age of 46, but his father lived to the age of 70, and his grandfather lived until 82. Are those useful reference points?
What we really want is the lifespan of dictators. Well, dictators of small countries. Oh, actually, dictators of small nuclear powers that know that Qaddafi was killed after renouncing his nuclear program - a reference class with no other members. Once again, of course, none of this is helpful.
In finance, the outside view is a consensus that markets are roughly rational, and the inside view is that you can beat the market. In international relations, the outside view is that dictatorships can be tenuous, but when the regime survives, the leadership lives quite a long time. The inside view is, perhaps, that China has a stake in keeping their nuclear neighbor stable, and won't let anything happen.
Reference classes depend on models of the world.
In each case, the construction of a reference class is a function of a model. Models induce reference classes - political scientists might have expert political judgement, while demographers have expert lifespan judgement, and 2nd year equity analysts have expert financial judgement. All of those are useful.
What reference class should have been used for COVID-19 in, say, mid-March? The set of emerging infectious diseases over the past decade? Clearly not. In retrospect, of course, the best reference class needed a epidemiological model - the reference class of diseases with R0>1, where spread is determined by control measures. And the reference class for the success of response in the US should have been based on a libertarian view of the failure of American institutions, or a Democrat's view of how Trump had been rapidly dismantling government, and not an index designed around earlier data which ignored political failure modes. But how do we know that in advance? Once again, none of this is helpful in deciding beforehand which reference class to use.
A final example. What reference class is useful for predicting the impact of artificial intelligence over the next decade? Robin Hanson would argue, I think, that it's the reference class of purported game-changing technologies that have not yet attracted significant amounts of capital investment. Eliezer Yudkowsky might argue that it's the reference class of intelligence evolving, sped up by a factor of what we've seen so far of computer intelligence, which moved from an AI winter in the mid-2000s and ant-level intelligence at navigation, to Deepmind being founded in 2010, to IBM’s Watson winning Jeopardy in 2011, to beating the Winograd Schema and acing general high-school science tests without specific training using GPT-3 now. And if you ask a dozen AI researchers, depending on your methods, you'll get at least another dozen reference classes. But we still need to pick a reference class.
So which reference class is correct? In my (inside) view as a superforecaster, this is where we turn to a different superforecasting trick, about considering multiple models. As the saying goes, hedgehogs know one reference class, but foxes consult many hedgehogs.