Can GPT-3 predict real world events? To answer this question I had GPT-3 predict the likelihood for every binary question ever resolved on Metaculus.

Predicting whether an event is likely or unlikely to occur, often boils down to using common sense. It doesn't take a genius to figure out that "Will the sun explode tomorrow?" should get a low probability. Not all questions are that easy, but for many questions common sense can bring us surprisingly far.

Experimental setup

Through their API I downloaded every binary question posed on Metaculus.
I then filtered them down to only the non-ambiguously resolved questions, resulting in this list of 788 questions.

For these questions the community's Mean Squared Error was 0.19, a good deal better than random!

Prompt engineering

GPT's performance is notoriously dependent on the prompt it is given.

  • I primarily measured the quality of prompts, on the percentage of legible predictions made.
  • Predictions were made using the most powerful DaVinci engine.

The best performing prompt was optimized for brevity and did not include the question's full description.

A very knowledgable and epistemically modest analyst gives the following events a likelihood of occuring:

Event: Will the cost of sequencing a human genome fall below $500 by mid 2016? 
Likelihood: 43%

Event: Will Russia invade Ukrainian territory in 2022?
Likelihood: 64%

Event: Will the US rejoin the Iran Nuclear Deal before 2023?
Likelihood: 55%

Event: <Question to be predicted>
Likelihood: <GPT-3 insertion>

I tried many variations, different introductions, different questions, different probabilities, including/excluding question descriptions, etc.

Of the 786 questions, the best performing prompt made legible predictions for 770. For the remaining 16 questions GPT mostly just wrote "\n".

If you want to try your own prompt or reproduce the results, the code to do so can be found in this Github repository.

Results

GPT-3's MSE was 0.33, which is about what you'd expect if you were to guess completely at random. This was surprising to me! GPT Why isn't GPT better?

Going into this, I was confident GPT would do better than random. After all many of the questions it was asked to predict, resolved before GPT-3 was even trained. There's probably some of the questions it knows the answer to and still somehow gets wrong!

It seems to me that GPT-3 is struggling to translate beliefs into probabilities. Even if it understands that the sun exploding tomorrow is unlikely, it doesn't know how to formulate that using numeric probabilities. I'm unsure if this is an inherent limitation of GPT-3 or whether its just the prompt that is confusing it.

I wonder if predicting using expressions such as "Likely" | "Uncertain" | "Unlikely", and interpreting these as 75% | 50% | 25% respectively could produce results better than random, as GPT wouldn't have to struggle with translating its beliefs into numeric probabilities. Unfortunately running GPT-3's best engine on 800 questions would be yet another hour and $20 I'm reluctant to spend, so for now that will remain a mystery.

 

It may be that even oracle AI's will be dangerous, fortunately GPT-3 is far from an oracle!

New Comment
9 comments, sorted by Click to highlight new comments since:
[-]Ruby210

>Unfortunately running GPT-3's best engine on 800 questions would be yet another hour and $20 I'm reluctant to spend, so for now that will remain a mystery

If you want to run more experiments like this, happy to get you a few more $$. I'd pay for it.

Suggested variation, which I'd expect to lead to better results: use raw "completion probabilities" for different answers.

E.g. with prompt "Will Russia invade Ukrainian territory in 2022?" extract completion likelihoods of the next few tokes "Yes" and "No". Normalize.

(the prompts you ask for quite unnatural map, at which most untrained humans are pretty bad as well)

Hi,

that is super fascinating. Would you be willing to share your data on this? I have some existing interest in exactly that question.

Hi Niplav, happy to hear you think that.

I just uploaded the pkl files that include the pandas dataframes for the metaculus questions and GPT's completions for the best performing prompt to github. Let me know if you need anything else :)

https://github.com/MperorM/gpt3-metaculus

Thanks a lot!

Rerunning with the https://arxiv.org/abs/2205.14334 finetuned model would be worthwhile.

Thanks for trying! I don't think that's much evidence against GPT3 being a good oracle though, bc to me it's pretty normal that without fine-tuning he's not able to forecast. He'd need to be extremely sample efficient to be able to do that. Does anyone want to try fine-tuning?

I was experimenting with exactly the same thing using GPT-4. Only 20 top questions, result - more or less equal to community.

But then it came to numeric estimations like “how much people will die due to covid” - it was outperforming humans giving highly accurate predictions (i was asking not for values but for ranges with quartiles, and results of humans had much higher dispersion comparing to GPT)

Also I was comparing only results of community predictions for January 2022 since gpt-4 was trained on sept 2021, and its unfair to compare predictions if people had a lot of additional evidence which gpt doesnt have.

if it’s interesting I can share methodology, results and dataset.

Can GPT-3 predict outcomes based on comments and the names of the commenters?