Thanks for doing this! I had been hoping someone would do basically exactly this sort of analysis. :)
I'm curious now, given how accurate your forecasts have turned out and maybe taking into account Jonny's remark that "the predictions are (to my eye) under-optimistic on capabilities", what are the most substantive changes you'd make to your 2025-26 predictions?
Summary
In 2021, @Daniel Kokotajlo wrote What 2026 Looks Like, in which he sketched a possible version of each year from 2022 - 2026. In his words:
Given it’s now 2025, I evaluated all of the predictions contained in the years 2022-2024, and subsequently tried to see if o3-mini could automate the process.
In my opinion, the results are impressive (NB these are the human gradings of his predictions):
Given the scenarios Daniel gave were intended as simply one way in which things might turn out, rather than offered as concrete predictions, I was surprised that over half were completely correct, and I think he foresees the pace of progress remarkably accurately.
Experimenting with o3-mini showed some initial promise, but the results are substantially worse than human evaluations. I would anticipate being able to produce a very significant improvement by using a multi-step resolution flow with web search, averaging the resolution score across multiple models etc.
Methodology
I went through 2022, 2023 and 2024 and noted anything that could be reasonably stated as a meaningful prediction. Not all of them were purely quantitative, nor did I attempt to necessarily quantify each prediction - for example, the verbatim “The hype is building” in the 2022 section became “In 2022, the hype is building”. Some other interesting examples:
I resolved them using the same framework for “smartness” that Arb used in their report evaluating the predictions of futurists - namely, correctness (0-2) and difficulty (1-5), with the following scoring:
Correctness:
0 - unambiguously wrong;
1 - ambiguous or near miss;
2 - unambiguously right
Difficulty:
1 - was already generally known
2 - was expert consensus
3 - speculative but on trend
4 - above trend, or oddly detailed
5 - prescient, no trend to go off
I spent roughly an hour resolving the predictions, and mostly went off my intuition, occasionally using a shallow Google search where I was particularly unsure. I updated several of my correctness scores after sharing the initial evaluation with Eli Lifland (who didn’t carefully review all predictions so doesn’t endorse all scores).
Results
How accurate are Daniel’s predictions so far?
I think the predictions are generally very impressive. The decrease in accuracy between years is to be expected, particularly given the author’s aim of providing an account of the future they believe is likely to be directionally correct, rather than producing concrete predictions intended to be evaluated in this way. I think the overall pace of progress outlined is accurate.
Based on my subjective application of the scoring system above, the most impressive predictions are the following (bear in mind the post was written in 2021, a year before the release of ChatGPT):
I didn’t think any of the predictions made were especially bad - every prediction that I assigned a ‘0’ for correctness (i.e. unambiguously wrong) had at least a 3 for difficult (i.e. more speculative than expert consensus on the scoring framework). With that being said, the ones that seemed most off to me were:
I found the average correctness score (0-2) to be 1.43, and the average smartness score (correctness * difficulty on a 1-5 scale) to be 3.37. For some reference, in the Arb report, Isaac Asimov’s average correctness score was 0.82 and his average smartness was 2.50, however this is a somewhat unfair comparison due to the vastly different timeframes involved (Asimov’s predictions were an average of 36 years away from the date he made them). If further predictors were evaluated using the same scoring rubric, a more direct comparison between them would be possible.
Skimming the predictions, I have the sense that the ones about specific technical claims are generally more accurate than the ones about broader societal impacts.
Similarly, the predictions are (to my eye) under-optimistic on capabilities.
Can LLMs extract and resolve predictions?
I used o3-mini-2025-01-31 with `reasoning_effort = high` for all model calls. Given the model’s knowledge cutoff is October 2023, I only evaluated it on the 2022 and 2023 predictions.
The input text was split into chunks containing <= 1000 tokens, ensuring paragraphs are maintained. The prediction extraction part of the script does two passes - an initial one where it’s asked to extract all the predictions and a second one where it’s given the existing predictions alongside the input text and asked to produce more. I tried various strategies and this was the one that extracted the closest amount of predictions to the manual evaluation.
Extraction
LLMs seem promising at extracting predictions from text, when compared to a human baseline. I recorded 33 predictions total, the model recorded 29 of which I judged 23 to be valid. There were two predictions I missed that the model found:
The model does make some unfortunate errors though, with several failure modes including:
NB: in the figure below, there is a significant difference between the additional “LLM only” predictions in 2022 and 2024. In 2022, the two additional predictions were ones that I judged to be valid and missed in my evaluation. In 2024, the additional predictions the LLM made are not clearly distinct or valid in the same way.
Resolution
Resolution was mediocre with the crude techniques I used, but also showed some promise. The model’s resolutions in this instance were handicapped by not having access to web search, which I did use to help resolve a handful of the predictions. The average correctness score delta was 0.75, which is huge given the score is bounded 0-2. The average difficulty score delta was 0.83, still somewhat large given the score is bounded 1-5.
I found the rationales provided by the models explaining their score generally compelling. Where we disagreed, it was often the case that it seemed they were lacking some piece of information that might’ve been retrievable with web search.
The one exception was when the model quite happily resolved a prediction from 2030. While it would be trivial to implement a basic check to prevent this, it was still interesting to me that the model would happily hallucinate a plausible-sounding answer.
Next Steps
I am seeking feedback on this work, particularly with regards to whether I should invest further time into it. The current system for resolving predictions is far too crude; I have started working on an agent-based workflow that can use web search in order to get closer to human resolution. I am cautiously optimistic that I could reduce the correctness score delta to ~0.2 with such a system, which I think would be much more acceptable. I haven’t tested prediction extraction on texts with predictions less explicit than the ones above, and would anticipate needing to spend some time refining that aspect of the system to handle such texts too.
The questions I am trying to answer are:
Thanks to Eli Lifland, Adam Binksmith and David Mathers for their feedback on drafts of this post.
Appendix
Prompts
Initial scraping prompt
Secondary scraping prompt
Scoring prompt
Raw data
The evaluation data Google sheet is available here.