Contra papers claiming superhuman AI forecasting

nikos; Peter Mühlbacher; Lawrence Phillips; dschwarz

[Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.]

Widespread, misleading claims about AI forecasting

Recently we have seen a number of papers – (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) – with claims that boil down to “we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance”.

These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse:

Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.:

A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like

"This is something that humans are notably terrible at, even if they're paid to do it. No surprise that LLMs can match us."
"+1 The aggregate human success rate is a pretty low bar"

A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that “AI […] can predict the future at a superhuman level” had more than half a million views within two days of being published.

The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it’s one that matters.

What does human-level or superhuman forecasting mean?

"Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans.

One reasonable and practical definition of a superhuman forecasting AI forecaster is

The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform.^[1]

(For a human-level forecaster, just replace "outperform" with "performs on par with".)

Red flags for claims to (super)human AI forecasting accuracy

Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including:

Failing to find up-to-date information on the questions. It’s inconceivable on most questions that forecasts can be good without basic information.
- Imagine trying to forecast the US presidential election without knowing that Biden dropped out.
Drawing on up-to-date, but low-quality information. Ample experience shows low quality information confuses LLMs even more than it confuses humans.
- Imagine forecasting election outcomes with biased polling data.
- Or, worse, imagine forecasting OpenAI revenue based on claims like
  > The number of ChatGPT Plus subscribers is estimated between 230,000-250,000 as of October 2023.
  without realising that this mixing up ChatGPT vs ChatGPT mobile.
Lack of high-quality quantitative reasoning. For a decent number of questions on Metaculus, good forecasts can be “vibed” by skilled humans and perhaps LLMs. But for many questions, simple calculations are likely essential. Human performance shows systematic accuracy nearly always requires simple models such as base rates, time-series extrapolations, and domain-specific numbers.
- Imagine forecasting stock prices without having, and using, historical volatility.
Retrospective, rather than prospective, forecasting (e.g. forecasting questions that have already resolved). The risk for leakage of data about the present into the forecast, either in the LLMs or in the information used in the forecast, is extremely hard to stamp out.

Points 1 and 2 could also be summarised as "not being good (enough) at information retrieval (IR)". We believe that "being good at IR" is both

necessary for being good at forecasting (and thus)
easier than being good at forecasting.

So if an agent fails at the IR stage, even the smartest and the most rational entity will struggle to turn this into a good forecast. This is basically just a roundabout way of saying GIGO.

A similar argument can be made for quantitative reasoning being important.

In the following, we go through issues with the papers in detail.

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (Schoenegger et al., 2024)

A quick glance over the paper shows a couple of suspicious points:

The architectures tested have virtually no information retrieval (IR). More precisely, 9 out of 12 LLMs (over whose predictions they take the median to obtain the final forecast) have no IR whatsoever and the 3 remaining ones have ChatGPT-like access to the internet when generating their forecast in response to a single, static prompt. (When we tried their prompt in ChatGPT with a question like "Will Israel and Hamas make peace before the end of the year?", GPT-4o didn't even check whether they have already made peace.)
Hence the aggregate forecast will usually not be aware of any recent developments that aren't already in the LLMs' memories.
The authors only looked at n=31 questions. But you need quite a large number of forecasts/resolved questions to accurately determine whether forecaster A is better than forecaster B (see e.g. this post).

And indeed, upon a closer look, one sees that the paper's titular claim, reiterated in the abstract ("the LLM crowd... is not statistically different from the human crowd") is not at all supported by the study: In the relevant non-preregistered part of the paper, they introduce a notion of equivalence: Two sets of forecasters are equally good if their Brier scores differ by no more than 0.081.

A difference in Brier scores of ≤.081 may sound small, but what does it mean?

The human aggregate in the study (avg. Brier of .19) would, according to this definition, count as equivalent to a forecaster who has a Brier score of ≤ 0.271 (=.19 + .081)). In their study, the human aggregate would e.g. count as equivalent to a forecaster who always predicts 50% (resulting in a Brier score of .25)
- In particular, this notion of equivalence is incompatible with their pre-registered result refuting their Null hypothesis 1, Study 1 (p3).
Being omniscient (i.e. knowing all the answers in advance, getting a Brier score of 0) would be equivalent to predicting ≈72% for every true and ≈28% for every false outcome (getting a Brier score of .081).
Tetlock's claims about Superforecasters would be invalidated because Superforecaster aggregates (avg. Brier of .146) would be equivalent to aggregates from all GJO participants (avg. Brier of .195).
- (Numbers taken from this GJO (a Tetlock-led organisation) white paper: Superforecasters: A Decade of Stochastic Dominance.)

Approaching Human-Level Forecasting with Language Models (Halawi et al., 2024)

This paper is of high quality and by far the best paper out of these four. The methodology looks serious and they implement a non-trivial model with information retrieval (IR).

Our main contention is that the title and conclusions risk leaving the reader with a misleading impression. The abstract reads:

On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it.

In the paper, they (correctly) state that a difference of .02 in Brier score is a large margin:

Only the GPT-4 and Claude-2 series beat the unskilled baseline by a large margin (> .02)

However, later on they summarize their main findings

As the main result, our averaged Brier score is .179, while the crowd achieves .149, resulting in a difference of .03.

So the main claim might as well read “There is still a large margin between human-level forecasting and forecasting with LLMs. These are the main results (note that accuracy, in contrast to the Brier score, is not a proper scoring rule):

Overall, differences are substantial. This result should not be very surprising since IR is genuinely hard and the example they show on page 25 just isn’t there yet: It just ends up finding links to Youtube and random users’ Tweets.

Reasoning and Tools for Human-Level Forecasting (Hsieh et al., 2024)

The standard for "human-level forecasting" in this paper is quite low. To create their dataset, the authors gathered questions from Manifold on April 15, 2024, and filtered for those resolving within two weeks. It's likely that this yielded many low-volume markets, making the baseline rather weak. Also, there's evidence to suggest that Manifold in general is not the strongest human forecasting baseline: In one investigation from 2023, Metaculus noticeably outperformed Manifold in a direct comparison on the same set of questions.

And there's a further methodological issue. The authors compare Manifold predictions from April 15, 2024 to LLM predictions from an unspecified later date, when more information was available. They try to mitigate this using Google's date range feature, but this feature is known to be unreliable.

Looking at a sample reasoning trace (page 7ff) also raises suspicions. It looks like their agent tries various approaches: Base rates, numerical simulations based on historical volatility, and judgemental adjustments. But both the base rate, as well as numerical simulations are completely hallucinated since their IR did not manage to find relevant data. (As pointed out above, good IR is a genuinely hard problem!)

It seems unlikely that a system relying on hallucinated base rates and numerical simulations goes all the way to outperforming (half-decent) human forecasters in any meaningful way.

LLMs Are Superhuman Forecasters (Phan et al., 2024)

Unlike (Halawi et al., 2024) and (Hsieh et al., 2024), they implicitly make the claim that no agent is needed for superhuman performance. Instead, two GPT-4o prompts with the most basic IR suffice.

There is a lot of pushback online, e.g. in the comment section of a related market (Will there be substantive issues with Safe AI’s claim to forecast better than the Metaculus crowd, found before 2025?) and on LessWrong. The main problems seem to be as follows:

Their results don’t seem to replicate on another set of questions (per Halawi). There is also some empirical evidence that the system doesn't seem to give good forecasts.

There is also data contamination:

misunderstandings about cutoff dates for GPT-4o and/or
data leakage from IR since determining when an article was last modified appears to depend on Google’s indexing (again, known to be faulty) and on this regex-based script

In addition, they only manage to beat the human crowd after applying some post-processing:

Maybe a fair criterion for judging "superhuman performance" could be "would you also beat the crowd if you applied the same post-processing to the human forecasts?"

Takeaways

Basic information retrieval is a hard problem. (See also our paper here.)
Advanced information retrieval, e.g. getting LLM-based systems to find high-quality relevant data without being thrown off by all the low-quality information is a hard problem.
Getting LLM-based systems to work out simple quantitative reasoning chains (e.g. base rates), instead of just hallucinating them, is genuinely hard.

All of the above appear to require significant engineering effort and extensive LLM scaffolding.

Simply throwing a ReAct agent (or another scaffolding method) at the problem and leaving the LLM to fend for itself is not enough with current LLMs.

Even a well-engineered effort, such as that from Halawi et al., produces chains of reasoning that often lag behind human forecasters, and fall far short of expert forecasting performance.

So how good are AI forecasters?

This remains to be seen. But taking it all together: from these papers, especially Halawi et al; FutureSearch's preliminary (but not paper-quality rigorous) evals; the current Metaculus benchmarking tournament; and anecdotal evidence, we are fairly confident that

Today's autonomous AI forecasting can be better than average, or even experienced, human forecasters,
But it's very unlikely that any autonomous AI forecaster yet built is close to the accuracy of a top 2% Metaculus forecaster, or the crowd.

References

Halawi, D., Zhang, F., Yueh-Han, C., & Steinhardt, J. (2024, February 28). Approaching Human-Level Forecasting with Language Models. arXiv. https://arxiv.org/pdf/2402.18563

Hsieh, E., Fu, P., & Chen, J. (2024, August 21). Reasoning and Tools for Human-Level Forecasting. arXiv. https://www.arxiv.org/pdf/2408.12036

Phan, L., Khoja, A., Mazeika, M., & Hendrycks, D. (2024, September). LLMs Are Superhuman Forecasters. https://drive.google.com/file/d/1Tc_xY1NM-US4mZ4OpzxrpTudyo1W4KsE/view

Schoenegger, P., Park, P., Tuminauskaite, I., & Tetlock, P. (2024, July 22). Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy. arXiv. https://arxiv.org/pdf/2402.19379

Edited Sept 12, 2024 to remove a claim that Phan et al. compared their results to the average of five random forecasts rather than the Metaculus community prediction.

Edited Sept 16, 2024 to clarify that Schoenegger et al.'s aggregate forecast will usually have no IR as it is the median over 12 models, 9 of which do not have access to the internet, instead of categorically ruling out IR.

^{^}
You could of course be even stricter than that, requiring forecasters to consistently beat any human or combination of humans. But that's hard to measure so we think what we proposed is a reasonable definition. You could also include financial markets. But traders already use a lot of computers and people who can reliably beat the markets usually have better things to do than writing academic papers...

[-]Neel Nanda6d153

Thanks for the post!

sample five random users’ forecasts, score them, and then average

Are you sure this is how their bot works? I read this more as "sample five things from the LLM, and average those predictions". For Metaculus, the crowd is just given to you, right, so it seems crazy to sample users?

[-]Lawrence Phillips6d200

Thanks Neel, we agree that we misinterpreted this. We've removed the claim.

[-]Neel Nanda5d30

Thanks for making the correction!

[-]habryka6d94

Yeah, I also think you misunderstood that part of the paper (though it really is very ambiguously written). My best guess is they are saying that they are averaging their performance metrics over 5 forecasts.

[-]Daniel Kokotajlo2d100

I'm curious what you guys think about the opportunities to train llms to make forecasts. E.g. Google is rumored to have snapshots of the whole internet going back many years; they could train a big new model on pre-2014 data and then have it make forecasts of the next decade and then have those forecasts automatically scored and trained on. If you guys got acquired by Google and given 10e26 flop compute budget, how would you use it?

[-]Lawrence Phillips1d30

We'd probably try something along the lines you're suggesting, but there are some interesting technical challenges to think through.

For example, we'd want to train the model to be good at predicting the future, not just knowing what happened. Under a naive implementation, weight updates would probably go partly towards better judgment and forecasting ability, but also partly towards knowing how the world played out after the initial training cutoff.

There are also questions around IR; it seems likely that models will need external retrieval mechanisms to forecast well for the next few years at least, and we'd want to train something that's natively good at using retrieval tools to forecast, rather than relying purely on its crystalised knowledge.

[-]Daniel Kokotajlo5h20

Yeah I imagine it would partly just learn facts about what happened - but as long as it also partly learns general forecasting skills, that is important and measurable progress. Might be enough to be very useful.

Re:retrieval:yep I am imagining that being part of the setup and part of what it learns to be good at.

[-]habryka6d42

Thank you for this!

Edit note: I slightly cleaned up some formatting in the post which looked a bit janky on mobile (due to heavily nested bullet lists with images). Feel free to revert.

[-]dschwarz6d41

Thank you for the careful look into data leakage in the other thread! Some of your findings were subtle, and these are very important details.

[-]kqr4d10

The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform

Seeing how the crowd forecast routinely performs at a superhuman level itself, isn't it an unfairly high bar to clear? Not invalidating the rest of your arguments – the methodological problems you point out are really bad – but before asking the question about superhuman performance it makes a lot of sense to fully agree on what superhuman performance really is.

(I also note that a high-quality forecasting platform suffers from self-selection by unusually enthusiastic forecasters, bringing up the bar further. However, I don't believe this to be an actual problem because if someone is claiming "performance on par with humans" I would expect that to mean "enthusiastic humans".)

[-]steven04613d40

As I understand it, the Metaculus crowd forecast performs as well as it does (relative to individual predictors) in part because it gives greater weight to more recent predictions. If "superhuman" just means "superhumanly up-to-date on the news", it's less impressive for an AI to reach that level if it's also up-to-date on the news when its predictions are collected. (But to be confident that this point applies, I'd have to know the details of the research better.)

[-]Peter Mühlbacher2d20

I agree it's a high bar, but note that

a few particularly enthusiastic (&smart) humans still perform at roughly this level (depending on how you measure performance), so you wouldn't want it to be much lower, and
we only acknowledged that this is a fairly reasonable definition of superhuman performance—it's authors in these papers who claimed that their models were (roughly) on par with, or better than the crowd forecast.

We made the deliberate choice of not getting too much into the details of what constitutes human-level/superhuman forecasting ability. We have a lot of opinions on this as well, but it is a topic for another post in order not to derail the discussion on what we think matters most here.

[-]Review Bot6d10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

LESSWRONG
LW

151