Thanks for the post!
sample five random users’ forecasts, score them, and then average
Are you sure this is how their bot works? I read this more as "sample five things from the LLM, and average those predictions". For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?
Yeah, I also think you misunderstood that part of the paper (though it really is very ambiguously written). My best guess is they are saying that they are averaging their performance metrics over 5 forecasts.
I'm curious what you guys think about the opportunities to train llms to make forecasts. E.g. Google is rumored to have snapshots of the whole internet going back many years; they could train a big new model on pre-2014 data and then have it make forecasts of the next decade and then have those forecasts automatically scored and trained on. If you guys got acquired by Google and given 10e26 flop compute budget, how would you use it?
We'd probably try something along the lines you're suggesting, but there are some interesting technical challenges to think through.
For example, we'd want to train the model to be good at predicting the future, not just knowing what happened. Under a naive implementation, weight updates would probably go partly towards better judgment and forecasting ability, but also partly towards knowing how the world played out after the initial training cutoff.
There are also questions around IR; it seems likely that models will need external retrieval mechanisms to forecast well for the next few years at least, and we'd want to train something that's natively good at using retrieval tools to forecast, rather than relying purely on its crystalised knowledge.
Yeah I imagine it would partly just learn facts about what happened - but as long as it also partly learns general forecasting skills, that is important and measurable progress. Might be enough to be very useful.
Re:retrieval:yep I am imagining that being part of the setup and part of what it learns to be good at.
Curated.
This is a great roundup of recent research, a clear explanation of how one can think about the work of an AI-generated forecast (separating out information retrieval from reasoning), and a lot of very clear and simple pointers to specifically which parts of these papers seem poorly argued or misleading. I certainly changed my mind after reading this post.
I love that LessWrong is a place where people with a lot of experience and knowledge can come to address a current discussion topic and write such a clearly written rebuttal. Yes, there's a conflict of interest here, but the arguments are clear, and I wonder if we'll see good counter-rebuttals.
Here's to more work and arguments on this subject in the future!
Thank you for this!
Edit note: I slightly cleaned up some formatting in the post which looked a bit janky on mobile (due to heavily nested bullet lists with images). Feel free to revert.
Thank you for the careful look into data leakage in the other thread! Some of your findings were subtle, and these are very important details.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I wonder if OpenAI's o1 changes the game here. Its architecture seems specifically designed for information retrieval
The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform
Seeing how the crowd forecast routinely performs at a superhuman level itself, isn't it an unfairly high bar to clear? Not invalidating the rest of your arguments – the methodological problems you point out are really bad – but before asking the question about superhuman performance it makes a lot of sense to fully agree on what superhuman performance really is.
(I also note that a high-quality forecasting platform suffers from self-selection by unusually enthusiastic forecasters, bringing up the bar further. However, I don't believe this to be an actual problem because if someone is claiming "performance on par with humans" I would expect that to mean "enthusiastic humans".)
As I understand it, the Metaculus crowd forecast performs as well as it does (relative to individual predictors) in part because it gives greater weight to more recent predictions. If "superhuman" just means "superhumanly up-to-date on the news", it's less impressive for an AI to reach that level if it's also up-to-date on the news when its predictions are collected. (But to be confident that this point applies, I'd have to know the details of the research better.)
I agree it's a high bar, but note that
We made the deliberate choice of not getting too much into the details of what constitutes human-level/superhuman forecasting ability. We have a lot of opinions on this as well, but it is a topic for another post in order not to derail the discussion on what we think matters most here.
I think it is fair to say that Metaculus' crowd forecast is not what would naively be thought of as a crowd average - the recency weighting does a lot of work, so a general claim that an individual AI forecaster (at say the 80th percentile of ability) is better than the human crowd is reasonable, unless specifically in the context of a Metaculus-type weighted forecast.
[Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.]
Widespread, misleading claims about AI forecasting
Recently we have seen a number of papers – (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) – with claims that boil down to “we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance”.
These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse:
Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.:
A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like
A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that “AI […] can predict the future at a superhuman level” had more than half a million views within two days of being published.
The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it’s one that matters.
What does human-level or superhuman forecasting mean?
"Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans.
One reasonable and practical definition of a superhuman forecasting AI forecaster is
(For a human-level forecaster, just replace "outperform" with "performs on par with".)
Red flags for claims to (super)human AI forecasting accuracy
Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including:
> The number of ChatGPT Plus subscribers is estimated between 230,000-250,000 as of October 2023.
without realising that this mixing up ChatGPT vs ChatGPT mobile.
Points 1 and 2 could also be summarised as "not being good (enough) at information retrieval (IR)". We believe that "being good at IR" is both
So if an agent fails at the IR stage, even the smartest and the most rational entity will struggle to turn this into a good forecast. This is basically just a roundabout way of saying GIGO.
A similar argument can be made for quantitative reasoning being important.
In the following, we go through issues with the papers in detail.
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (Schoenegger et al., 2024)
A quick glance over the paper shows a couple of suspicious points:
Hence the aggregate forecast will usually not be aware of any recent developments that aren't already in the LLMs' memories.
And indeed, upon a closer look, one sees that the paper's titular claim, reiterated in the abstract ("the LLM crowd... is not statistically different from the human crowd") is not at all supported by the study: In the relevant non-preregistered part of the paper, they introduce a notion of equivalence: Two sets of forecasters are equally good if their Brier scores differ by no more than 0.081.
A difference in Brier scores of ≤.081 may sound small, but what does it mean?
Approaching Human-Level Forecasting with Language Models (Halawi et al., 2024)
This paper is of high quality and by far the best paper out of these four. The methodology looks serious and they implement a non-trivial model with information retrieval (IR).
Our main contention is that the title and conclusions risk leaving the reader with a misleading impression. The abstract reads:
In the paper, they (correctly) state that a difference of .02 in Brier score is a large margin:
However, later on they summarize their main findings
So the main claim might as well read “There is still a large margin between human-level forecasting and forecasting with LLMs. These are the main results (note that accuracy, in contrast to the Brier score, is not a proper scoring rule):
Overall, differences are substantial. This result should not be very surprising since IR is genuinely hard and the example they show on page 25 just isn’t there yet: It just ends up finding links to Youtube and random users’ Tweets.
Reasoning and Tools for Human-Level Forecasting (Hsieh et al., 2024)
The standard for "human-level forecasting" in this paper is quite low. To create their dataset, the authors gathered questions from Manifold on April 15, 2024, and filtered for those resolving within two weeks. It's likely that this yielded many low-volume markets, making the baseline rather weak. Also, there's evidence to suggest that Manifold in general is not the strongest human forecasting baseline: In one investigation from 2023, Metaculus noticeably outperformed Manifold in a direct comparison on the same set of questions.
And there's a further methodological issue. The authors compare Manifold predictions from April 15, 2024 to LLM predictions from an unspecified later date, when more information was available. They try to mitigate this using Google's date range feature, but this feature is known to be unreliable.
Looking at a sample reasoning trace (page 7ff) also raises suspicions. It looks like their agent tries various approaches: Base rates, numerical simulations based on historical volatility, and judgemental adjustments. But both the base rate, as well as numerical simulations are completely hallucinated since their IR did not manage to find relevant data. (As pointed out above, good IR is a genuinely hard problem!)
It seems unlikely that a system relying on hallucinated base rates and numerical simulations goes all the way to outperforming (half-decent) human forecasters in any meaningful way.
LLMs Are Superhuman Forecasters (Phan et al., 2024)
Unlike (Halawi et al., 2024) and (Hsieh et al., 2024), they implicitly make the claim that no agent is needed for superhuman performance. Instead, two GPT-4o prompts with the most basic IR suffice.
There is a lot of pushback online, e.g. in the comment section of a related market (Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?) and on LessWrong. The main problems seem to be as follows:
Their results don’t seem to replicate on another set of questions (per Halawi). There is also some empirical evidence that the system doesn't seem to give good forecasts.
There is also data contamination:
In addition, they only manage to beat the human crowd after applying some post-processing:
Maybe a fair criterion for judging "superhuman performance" could be "would you also beat the crowd if you applied the same post-processing to the human forecasts?"
Takeaways
All of the above appear to require significant engineering effort and extensive LLM scaffolding.
Simply throwing a ReAct agent (or another scaffolding method) at the problem and leaving the LLM to fend for itself is not enough with current LLMs.
Even a well-engineered effort, such as that from Halawi et al., produces chains of reasoning that often lag behind human forecasters, and fall far short of expert forecasting performance.
So how good are AI forecasters?
This remains to be seen. But taking it all together: from these papers, especially Halawi et al; FutureSearch's preliminary (but not paper-quality rigorous) evals; the current Metaculus benchmarking tournament; and anecdotal evidence, we are fairly confident that
References
Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024, February 28). Approaching Human-Level Forecasting with Language Models. arXiv. https://arxiv.org/pdf/2402.18563
Hsieh, E., Fu, P., & Chen, J. (2024, August 21). Reasoning and Tools for Human-Level Forecasting. arXiv. https://www.arxiv.org/pdf/2408.12036
Phan, L., Khoja, A., Mazeika, M., & Hendrycks, D. (2024, September). LLMs Are Superhuman Forecasters. https://drive.google.com/file/d/1Tc_xY1NM-US4mZ4OpzxrpTudyo1W4KsE/view
Schoenegger, P., Park, P., Tuminauskaite, I., & Tetlock, P. (2024, July 22). Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy. arXiv. https://arxiv.org/pdf/2402.19379
Edited Sept 12, 2024 to remove a claim that Phan et al. compared their results to the average of five random forecasts rather than the Metaculus community prediction.
Edited Sept 16, 2024 to clarify that Schoenegger et al.'s aggregate forecast will usually have no IR as it is the median over 12 models, 9 of which do not have access to the internet, instead of categorically ruling out IR.
You could of course be even stricter than that, requiring forecasters to consistently beat any human or combination of humans. But that's hard to measure so we think what we proposed is a reasonable definition. You could also include financial markets. But traders already use a lot of computers and people who can reliably beat the markets usually have better things to do than writing academic papers...