I wouldn't generalize too much from a forecasting competition.
Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain. The winner is likely a guy who lucked out on the choice of performance evaluation which fit his algorithm better than the competition. It doesn't mean he'll win the next competition. And it doesn't mean he isn't good, but it likely means that he was good and lucky.
How do we judge the potential and promise of the new complicated forecasting method?
Theory and judgment play a part.
When I first saw the Deep Learning method presented by Hinton, I was confident that it would be good without seeing the results, as it looked like a great theoretical approach, attacking the problem the right way.
Same thing with Wolpert and Stacked Generalization.
What to bet on? Things that theoretically look good, but are currently computationally cost prohibitive. As computers improve, there is an algorithmic land grab by researchers rushing into the areas that become computationally tractable.
Per Wolpert's No Free Lunch theorems, algorithm performance depends on fit to problem domain.
Aren't all these forecasting competitions using real data from real-world problems, and so NFL is irrelevant?
Real-world problems are not a random sampling from all possible problems and there's plenty of structure to exploit, so invoking NFL in this context seems odd to me.
A real world competition isn't a random sample of anything. It's a selection of some problems, with some data. The performance of any algorithm will depend on fit to those problems, with those data.
My takeaways from the NFL theorems - the problems in the real world are some structured subset of all possible problems, and the performance of any generalizer for a problem will depend on fit to that problem.
The performance of any algorithm will depend on fit to those problems, with those data.
That's not chopped liver.
This post has been written in relation with work I'm doing for the Machine Intelligence Research Institute (MIRI), but hasn't been formally vetted by MIRI. I'm posting this to LessWrong because of its potential interest to a segment of the LessWrong readership. As always, all thoughts are appreciated.
In this post, I'll try to apply some of the scientific theory of paradigm shifts to the domain of forecasting. In a sense, all of science is about making (conditional) predictions about the behavior of systems. Forecasting simply refers to the act of making predictions about the real-world future rather than about a specific controlled experimental setup. So while the domain of forecasting is far more restricted than the domain of science, we can still apply the conceptual framework of paradigm shifts in science to forecasting.
Thomas Kuhn and paradigm shifts
Thomas Kuhn's book The Structure of Scientific Revolutions (Amazon, Wikipedia) provides a detailed descriptive theory of the nature of paradigm shifts in science. Quoting from the Wikipedia page on paradigm shifts:
Simple methods in science are often "good enough" until you want a much higher "resolution"
It's worth noting that most paradigm shifts move from simpler, more tractable models to more complicated ones. Initially, the scientific theory is not trying to explain the real world at too fine a resolution, and therefore it is tolerant of large errors. The theories in vogue initially are the simplest among those that can explain the world within the generous margin of error. Over time, as measurement becomes more precise and accurate, and the desire for understanding or engineering at stronger precision levels becomes more important, the focus shifts to finding a model where error rates are lower, accepting a possible increase in the model complexity.
Consider the following examples:
Complexity and paradigms in the context of forecasting
There are often competing methods for forecasting a given indicator. The methods vary considerably in complexity. For instance, persistence is one of the simplest forecasting methods: persistence of levels means that tomorrow will be the same as today, whereas persistence of trends means that the difference between tomorrow and today equals the difference between today and yesterday. Somewhat more sophisticated than simple persistence is various variations of linear regression that are well-suited to time series and tackle the problems both of periodic fluctuation and noise. More sophisticated methods allow for functional forms obtained by additive or multiplicative combination, or composition, of the functional forms used in simpler methods.
Here are some measures of complexity for forecasting methods:
Do complicated methods beat simpler methods?
The Makridakis Competitions are often cited as canonical sources of information for how different types of quantitative trend forecasting compare. Makridakis and Hibon draw four conclusions (listed on the linked page and in their papers) of which Finding 1 is most relevant to us: "Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones." Some people (such as Nassim Nicholas Taleb) have used this to argue that sophisticated methods are useless.
The conclusion drawn by Makridakis and Hibon is supported by the data, but there is less to it than meets the eye. As noted earlier in the post, even the most revolutionary and impressive complicated scientific paradigms (such as relativity and quantum mechanics) only rarely outperform the simpler, more widely known paradigms (such as classical mechanics) except in cases that are designed to draw on the strength of the new paradigm (such as high speed or small length scales). And yet, in the cases where those slight improvements matter, we may be able to improve a lot by using the more sophisticated model. Just as knowledge of relativity makes possible a high-precision GPS that would have been impossible otherwise, new forecasting paradigms may make possible things (such as just-in-time inventory management) that would not have been possible at anywhere near that level of quality otherwise.
Of course, the selection of the sophisticated method matters: some sophisticated methods are simply wrong-headed and will therefore underperform simpler methods except in tailor-rigged situations. But the key point here is that an appropriately selected sophisticated model with access to adequate data and computational resources can systematically outperform simpler models. Finding (2) for the Makridakis Competitions is "The relative ranking of the performance of the various methods varies according to the accuracy measure being used." Finding 4 says "The accuracy of the various methods depends on the length of the forecasting horizon involved." The choice of best method also varies across types of time series (so the best method for macroeconomic time series could differ from the best method for time series provided by industries for their production or sales data).
Duncan Watts makes a similar point in his book Everything is Obvious: One You Know The Answer (paraphrased): sophisticated methods don't offer a huge advantage over simpler methods. But the best sophisticated methods are modestly better. And if you're operating at a huge scale (for instance, if you're running an electrical utility that needs to forecast consumer demand, or you're WalMart and you need to manage inventory to minimize waste, or if you're Google or Facebook and need to forecast the amount of traffic in order to budget appropriately for servers), even modest proportional improvements to accuracy can translate to huge absolute reductions in waste and increase in profits.
The evolution of complicated methods
Complicated methods can start off as performing a lot worse than simpler methods, and therefore be deemed useless. But then, at some point, they could start overtaking simpler methods, and once they overtake, they could rapidly gain on the simpler methods. What might change in the process? It could be any of these three, or some combination thereof.
How do we judge the potential and promise of the new complicated forecasting method?
Given a complicated method that people claim could work given sufficient data or computing power that we don't yet have access to, how are we to judge the plausibility of the claims? The question is similar to the general question of whether a new proclaimed model or theory is the harbinger of a paradigm shift in a scientific discipline. I don't have satisfactory answers. In a subsequent post, I'll look at a few historical and current examples of changes of paradigm shifts in forecasting. The examples that I currently plan to cover are:
Any thoughts on the post as a whole would be appreciated, but I'm particularly interested in thoughts on this last topic in the post.
Thanks to Luke Muehlhauser for helpful early discussions that led to this post and to Jonah Sinick for his thoughts on an early draft of the post.