I agree with basically everything in the post, and especially that simple linear models are way undervalued. I've also come across cases where experts using literally 100x more data in there models get a worse outcome than other experts because they made a single bad assumption and didn't sanity check it properly. And I've seen cases where someone builds a linear model on the reciprocal of the variable they should have used, or where they didn't realize they were using a linear approximation of an exponential too far from the starting point. Modeling well is itself a skill that requires expertise and judgment. Other times, I see people build a simple linear model, which is built well, and then fail to notice or understand what it's telling them.
There's a Feynmann quote I love about talking simple models seriously:
As they're telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)—disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn't true for my hairy green ball thing, so I say, 'False!'
And a Wittgenstein quote about not thinking enough about what model predictions and observations imply:
“Tell me,” the great twentieth-century philosopher Ludwig Wittgenstein once asked a friend, “why do people always say it was natural for man to assume that the sun went around the Earth rather than that the Earth was rotating?” His friend replied, “Well, obviously because it just looks as though the Sun is going around the Earth.” Wittgenstein responded, “Well, what would it have looked like if it had looked as though the Earth was rotating?”
Literally last week I was at an event listening to an analyst from a major outlet that produces model-based reports that people pay a lot of money for. They were telling an audience of mostly VCs that their projections pretty much ignore the future impact of any technology that isn't far enough along to have hard data. Like for energy, they have projections about nuclear, but exclude SMRs, and about hydrogen, but exclude synthetic hydrocarbons. Thankfully most of the room immediately understood (based on conversations I had later in the day) that this meant the model was guaranteed to be wrong in the most important cases, even though it looks like a strong, well-calibrated track record.
The solution to that, of course, is to put all the speculative possibilities in the model, weight them at zero for the modal case, and then do a sensitivity analysis. If your sensitivity analysis shows that simple linear models vary by multiple orders of magnitude in response to small changes in weights, well, that's pretty important. But experts know if they publish models like that, most people will not read the reports carefully. They'll skim, and cherry-pick, and misrepresent what you're saying, and claim you're trying to make yourself unfalsifiable. They'll ignore the conditionality and probabilities of the different outcomes and just hear them all as "Well it could be any of these things." I have definitely been subject to all of those, and at least once (when the error bars were >100x the most likely market size for a technology) chose not to publish the numerical outcomes of my model at all.
To push back a little:
In fact, I’d go further and argue that explainatory modeling is just a mistaken approach to predictive modeling. Why do we want to understand how things work? To make better decisions. But we don’t really need to understand how things work to make better decisions, we just need to know how things will react to our what we do.
The word "react" here is a causal term. To predict how things will "react" we need some sort of causal model.
What makes predictive modeling a better idea is that it also allows us to find factors that are not causal, but still useful.
Usefulness is also a causal notion. X is useful if it causes a good outcome. If X doesn't cause a good outcome, but is merely correlated with it, it isn't useful.
Oh, these are good objections. Thanks!
I'm inclined to 180 on the original statements there and instead argue that predictive modelling works because, as Pearl says, "no correlation without causation". Then an important step when basing decisions on predictive modelling is verifying that the intervention has not cut off the causal path we depended on for decision-making.
Do you think that would be closer to the truth?
I see this as less of an endorsement of linear models and more of a scathing review of expert performance.
- When an arithmetic model is calibrated, it is specifically by including feedback from the real-world effects of its predictions. Experts do not, as a rule, seek out any feedback on their calibration.
This. Basically, if your job is to do predictions, and the accuracy of your predictions is not measured, then (at least the prediction part of) your job is bullshit.
I think that if you compare simple linear models in domains where people actually care about their predictions, the outcome would be different. For example, if simple models predicted stock performance better than experts at investment banks, anyone with a spreadsheet could quickly become rich. There are few if any cases of 'I started with Excel and 1000$, and now I am a billionaire'. Likewise, I would be highly surprised to see a simple linear model outperform Nate Silver or the weather forecast.
Even predicting chess outcomes from mid-game board configurations is something where I would expect human experts to outperform simple statistical models working on easily quantifiable data (e.g. number of pieces remaining, number of possible moves, being in check, etc).
Neural networks contained in animal brains (which includes human brains) are quite capable of implementing linear models, and such should at least perform equally well when they are properly trained. A wolf pack deciding to chase or not chase some prey has direct evolutionary skin in the game of making their prediction of success as accurate as possible which the average school counselor predicting academic success simply does not have.
--
You touch this a bit in 'In defense of explainatory modeling', but I want to emphasize that uncovering causal relationships and pathways is central to world modelling. Often, we don't want just predictions, we want predictions conditional on interventions. If you don't have that, you will end up trying to cure chickenpox with makeup, as 'visible blisters' is negatively correlated with outcomes.
Likewise, if we know the causal pathway, we have a much better basis to judge if some finding can be applied to out-of-distribution data. No matter how many anvils you have seen falling, without a causal understanding (e.g. Newtonian mechanics), you will not be able to reliably apply your findings to falling apples or pianos.
LessWrong user dynomight explains how arithmetic is an underrated world-modeling technology and uses dimensional algebra as the motivating case. I agree dimensional algebra is fantastic, but there’s an even better motivating example for arithmetic in world-modeling: linear models for prediction.
Simple linear models outperform experts
In 1954, Paul Meehl published what he later came to call my disturbing little book. This book[1] contains the most important and well-replicated research I know of; yet most people don’t know about it. The basic argument is that many real-world phenomena – even fickle ones – can be adequately modeled with addition and multiplication.
Tempered by the lack of evidence at the time, the book doesn’t go quite as far as it could have. Here are some statements that have later turned out to be true, given in order of increasing outrageousness.
Obviously, these are phrased to provoke, and take additional nuance to be fully understood, but the general theme remains the same: addition and multiplication take you surprisingly far.
Predictive modeling is more important than explainatory modeling
There are two reasons to make models: predictive and explainatory.[2]
Most people reason about the world through explainatory modeling, and are uncomfortable with predictive modeling. I would argue that predictive modeling is the better approach.
Predictive modeling has the funny property that it is acausal and atemporal: we can predict the risk of someone being in an accident when they are driving drunk, but we can also predict the probability that someone was drunk when we have observed an accident. I think this is what makes people step away from predictive models. Once we are using consequences to predict antecedents, we are flaunting our ignorance of all the nice logic the Greeks came up with – even though it was a while since Bayes and Laplace taught us this is fine.[3]
Predictive modeling doesn’t really care what was the cause, or what came first. It is all about figuring out which observations tend to come together.
In fact, I’d go further and argue that explainatory modeling is just a mistaken approach to predictive modeling. Why do we want to understand how things work? To make better decisions. But we don’t really need to understand how things work to make better decisions, we just need to know how things will react to our what we do. Knowing how they work can be an aid in that, but when it is, predictive modeling will pick up also on causal factors.
What makes predictive modeling a better idea is that it also allows us to find factors that are not causal, but still useful.[4]
In defense of explainatory modeling
Something Pearl emphasises in his book on causal inference[5] is one of the primary strengths of causal reasoning: the stability of the relationships it uncovers.[6]
Using drunk driving as an example, it would be reasonable to believe that drunk driving does not cause accidents at all, it is just that people who drive under the influence tend to also behave recklessly in other ways, and this is what causes accidents. But we can test that: we can get randomly selected people drunk and then put them in a driving simulator, and note whether the drunk group has a higher accident rate. They do, because drunk driving is causally associated with accidents, meaning the relationship is stable even when we control for other factors like personality type.
Predictive models are not guaranteed to be stable like causal models. What we need to do with predictive modeling is find out under which conditions the models hold, and find alternative models for when those conditions are violated. The upshot is that predictive models are much easier to construct, meaning we can take advantage of modeling for a much lower cost.
When arithmetic beats expertise
Meehl lists several comparisons of expertise and arithmetic in his disturbing little book. I strongly encourage you to read the book for more motivation and evidence, but here are some representative examples:
Note that even if the expert seems to perform on the same level as arithmetic in some of these examples, the expert often refused to predict on difficult cases. If we assume their refusal to predict is a middle-of-the-range prediction, the expert performance becomes worse than the linear model.
But this does not mean we can go without experts. To run a linear model, we need data. Experts are good at extracting data from complicated situations. Another example Meehl brings up in the book is how movements in family casework were roughly as well predicted by a five-variable linear regression as by experts. What’s curious about this are the variables used:
Experts are good at taking complicated qualitative data and distilling it down to a quantitative range.[7]
A few years later[8], Meehl described the current-until-then state of the research:
There has been a lot of research on this since the 1950s, but to close this section in this article, we’ll briefly summarise a meta-study[9] performed in 2000. The authors surveyed 136 comparisons between expert judgment and arithmetical data combination, on subjects like
They conclude that arithmetically combining data is on average 10 % more accurate than expert judgment. The 136 comparisons break down into the following three cases.
Across the board, the experts did not perform better with more experience in the field, nor did they perform better with more data. In fact, there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly (like a clinical interview.)
There are three reasons arithmetic wins even though it is often equal to expert judgment:
I know what the reader thinks to themselves at this point:
It is worth emphasising that the meta-study covered a wide range of fields, and the authors did not find any systematic indicator of when expert performance is better than arithmetic. This means if we assume our judgment is better than arithmetic we are betting that our case randomly falls into that 11 % bucket. We are playing Russian roulette with five chambers loaded and one empty!
It would be sensible to do so if expert jugment was much cheaper than arithmetic, but it’s not. In his disturbing little book, Meehl puts this particularly well.
We cannot choose policies based on the effect they have on individual cases, because the choice of individual case implies conflicting decisions. When we choose policy, we must do it based on the aggregate effects of that policy, even when it has known, heart-breaking consequences in individual cases.
Improper linear models are still better than experts
To see the full extent to which arithmetic outperforms expertise, we will turn to Dawes[11], who breaks up expert judgment into components and estimates the value contributed by each component.
We pretend an expert prediction is composed of three actions:
The expert does not know these are the things they do – the expert just looks at a bunch of data and makes a prediction based on their gut feel. But we can use this model to dive into why expert predictions don’t work as well as arithmetic.
Parametric representation removes noise
Parametric representation is a way to average out the noise component from expert judgment. It works by asking experts to predict based on randomly generated data, and then training a linear model on the expert predictions. Since the training data does not contain outcomes, we are actually figuring out the shape of the mathematical function in the experts’ heads. We are using the experts to select predictors, determine their direction and relative weight, but then the linear regression will average away the noise of their predictions. What we find is that this linear model, trained on expert predictions only, outperform experts. The conclusion is simple: the noise experts add is not useful.
Experts do not sensibly assign weights
What about assigning weights? How much value does the expert add to that activity? We can find out by comparing the parametric representation with the same model except with weights assigned randomly. Both of those models perform on a roughly equal level, which indicates that experts, on average, select weights randomly.
Experts are good at selecting variables
These results are based on a small set of studies, and thus cannot be considered conclusive. What they do is waggle their eyebrows suggestively in the direction of the tree Dawes’ is barking up:
Just count it. What? Something. Anything!
Given the strong performance of optimally-weighted linear regression over even expert-assigned weights, we might find ourselves in a predicament: finding optimal weights requires knowledge of the outcome, and we often can’t measure the outcome – either because we don’t know how to, or because we lack historic data.
Under those circumstances, we can still do better than random weights: unit weights. Linear models are surprisingly powerful even with unit weights, and this is a good thing because it’s easy to find situations where estimated weights are unstable:
In these situations, setting all weights to be equal yields higher predictive accuracy than both random weights and attempting to estimate optimal weights.[12]
One funny consequence of the optimality of unit-weighted models is we can do this also when we are unable to measure the outcome we are trying to optimise for, such as in the case of hiring in small organisations. We don’t know exactly what job success looks like, but we have a decent idea of which factors contribute to it, so we can measure our candidates on those variables, and then combine those measurements with equal weights.
As Dawes snappily puts it:
Although it does get into dangerous territory, this can even be used to overcome measurement problems. Maybe we are looking to hire people who are willing to challenge the consensus, but we don’t know how to measure this willingness. What we can do is find an easily-measured proxy, such as counting the frequency with which the applicant contradicts one of the interviewers. This can happen for any number of reasons which are unrelated to job performance (“Would you like tea or coffee?” “No, thank you. Water is good.”), but if we try to be clever we risk introducing noise. Instead, we can just run the dumb count and combine with equal weights. It may well perform better than expert judgment.
Rationale
The authors of the meta-study referenced above ask themselves why their results were obtained, and speculate that
Meehl with some coauthors[13] dive a little deeper into reasons arithmetic prediction is so strong:
Arithmetic models also suck, but don’t let that stop you
One of the common objections against arithmetic models I have encountered is that they still suck. They do. You might get away from all the words above that linear models will perform well, but the truth is that generally they don’t. All the above is saying is that linear models perform better than experts, but experts also suck.
The difference is that when we have developed an arithmetic model, we can usually give a number indicating how well it performs, and this is the first time people are faced with how difficult prediction is. The expert may have predicted the same thing worse for a decade, but nobody ever evaluated their track record so closely. They’re an expert! Obviously they know what they are doing, don’t they? So when people are faced with poor predictive power for the first time it is usually in the context of an arithmetic model, and they reject it because “We must be able to do better than that!” Here, I agree with Dawes, and tend to answer, “Really? What makes you think so?”
Many real-world outcomes arise as complicated interactions between a multitude of variables, and they are genuinely very hard to predict. There’s no reason to think we can do better, and when we try (e.g. by hiring an expert), we run a very large risk of just introducing noise which makes the predictions – on aggregate – worse, regardless of the effect on individual cases.
Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence; Meehl; University of Minnesota Press; 1954.
Meehl calls these discriminative and structural, but I think predictive and explainatory is the more modern terminology.
Even Fisher was suspicious of the “method of inverse probability”, and tried – but failed – to replace it with something better.
A typical case would be consequences of a common antecedent: high intelligence results in good grades, but is also associated with a good grip on language. This means you can predict someone’s elementary school test scores based on the first draft of an essay they wrote in a completely different situation. I know because that was a game I used to play in elementary school.
Causality: Models, Reasoning, and Inference; Pearl; Cambridge University Press; 2009.
Another strength of causal modeling is social, namely that people are more ready to accept a causal model.
Anyone competing in forecasting contests knows this already: a simple way to improve one’s forecast is to break it down into components, estimate the components, and then combine into a final prediction. One reason this helps is that errors in estimating each component cancel out, another reason it helps is through coherence laws relating the estimations of the components to each other.
When Shall We Use Our Heads Instead of the Formula?; Meehl; Journal of Counseling Psychology; 1957.
Clinical Versus Mechanical Prediction: A Meta-Analysis; Grove, Zald, Lebow, Snitz, Nelson; Psychological Assessment; 2000.
One thing Meehl points out experts can do which arithmetic cannot is generate new hypotheses. To predict arithmetically, we need to aim for a criterion, or a fixed set of potential outcomes. This set needs to be conjured out of thin air, and that is something experts do really well.
The Robust Beauty of Improper Linear Models in Decision Making; Dawes; American Psychologist; 1979.
Dawes has a footnote on the technical details here, for the interested.
Clinical Versus Actuarial Judgment; Dawes, Faust, Meehl; Science; 1989.
Noise: A Flaw in Human Judgment; Kahneman, Sibony, Sunstein; Little, Brown and Company; 2021.
In case you don’t see why: imagine detaining everyone with a subtle eeg abnormality if it is relatively common in the population. Most of these people will not be criminals, even if every single criminal has an eeg abnormality!