Thanks for writing this!
Regarding how surprise on current forecasts should factor into AI timelines, two takes I have:
* Given that all the forecasts seem to be wrong in the "things happened faster than we expected" direction, we should probably expect HLAI to happen faster than expected as well.
* It also seems like we should retreat more to outside views about general rates of technological progress, rather than forming a specific inside view (since the inside view seems to mostly end up being wrong).
I think a pure outside view would give a median of something like 35 years in my opinion (based on my very sketchy attempt of forming a dataset of when technical grant challenges were solved), and then ML progress seems to be happening quite quickly, so you should probably adjust down from that.
Actually pretty interested how you get to medians of 40 years, that seems longer than I'd predict without looking at any field-specific facts about ML, and then the field-specific facts mostly push towards shorter timelines.
Given that all the forecasts seem to be wrong in the "things happened faster than we expected" direction, we should probably expect HLAI to happen faster than expected as well.
I don't think we should update too strongly on these few data points; e.g. a previous analysis of Metaculus' AI predictions found "weak evidence to suggest the community expected more AI progress than actually occurred, but this was not conclusive". MATH and MMLU feel more relevant than the average Metaculus AI prediction but not enough to strongly outweigh the previous findings.
It also seems like we should retreat more to outside views about general rates of technological progress, rather than forming a specific inside view (since the inside view seems to mostly end up being wrong)
I think a pure outside view would give a median of something like 35 years in my opinion (based on my very sketchy attempt of forming a dataset of when technical grant challenges were solved), and then ML progress seems to be happening quite quickly, so you should probably adjust down from that.
I'd be interested to check out that dataset! Hard for me to react too much to the strategy without more details, but outside-view-ish reasoning about predicting things far-ish in the future that we don't know much about (and as you say, have often been wrong on the inside view) seems generally reasonable to me.
Actually pretty interested how you get to medians of 40 years, that seems longer than I'd predict without looking at any field-specific facts about ML, and then the field-specific facts mostly push towards shorter timelines.
I mentioned in the post that my median is now ~2050 which is 28 years out; as for how I formed my forecast, I originally roughly start with Ajeya's report, added some uncertainty and had previously shifted further out due to intuitions I had about data/environment bottlenecks, unknown unknowns, etc. I still have lots of uncertainty but my median has moved sooner to 2050 due to MATH forcing me to adjust my intuitions some, reflections on my hesitations against short-ish timelines, and Daniel Kokotajlo's work.
Based on the title, I was expecting a post about forecasting about your personal life. Not sure how to formulate a more accurate title.
Overview
I’ve been forecasting with some level of activity for over 2 years now, so I’m overdue for a retrospective.[1]
I discuss:
Platform track records
Metaculus
I started forecasting fairly actively on Metaculus in 2020 when I got interested due to COVID forecasting, though recently I have mostly just been predicting on particularly interesting questions every once in a while. I’m currently 9th on the overall leaderboard, due to me trying to predict on ~every question for about a year.
My binary calibration:
I might be a little underconfident? The blue rectangle in each bin represents the 50% confidence interval, and it looks like about 11 of the 20 rectangles contain the bin.
My continuous calibration:
From Metaculus: “We define the surprisal at the resolution to be the total probability assigned to all possible (in-range) resolution values that were thought to be more likely than the resolution itself. A resolution at the mode (peak) of the distribution has a surprisal of zero, whereas a resolution at the minimum of the distribution is maximally surprising.”
I’m guessing the spike at 0-5% surprisal is mostly due to the cases where the value was effectively known before the question closed.
On the other hand, the spike at 95-100% surprisal is conveying something real: my tails are, on average, not long enough. This is an interesting contrast to my binary predictions where ~all my 0-5% and 95-100% predictions look to have been right. Forming calibrated continuous distributions is tricky; in Metaculus’ interface it might be wise to add high variance low weight components, to generally account for unknown unknowns.
I’m not sure what’s going on with the drop at 5-10%, but otherwise this looks pretty good and the overall average surprisal of 46% seems decent.
My average log score (higher is better) vs. the community predictions are as follows:
It’s hard to really know what to make of these because:
Given (2) above, I’m pretty happy about performing approximately the same as the community on questions evaluated at close time.
GJOpen
GJOpen was the first platform I predicted on after reading Superforecasting in ~2018, but I was fairly inactive for a while. I was a bit more active for about a year in 2020-2021.
My calibration across 3,293 data points from 964 forecasts on 139 questions, displayed by this app:[3]
This overall looks pretty good, but perhaps I have a bit of a bias toward predicting events are more likely to occur than they actually are?
My overall Brier score is 0.23 vs. a median of 0.301 from other predictors (lower is better); I’m pretty happy about this overall. The difference is a bit inflated as I sometimes selected questions based on the community seeming very wrong, after having received tips from other forecasters.
INFER
I predicted on Foretell, now INFER, during the 2020 and 2021 seasons.
I’m first on the overall leaderboard, and my team Samotsvety Forecasting placed first on the leaderboard for the first 2 seasons.
While I’m pretty happy about being first here given that there were ~50 pro forecasters during the 2021 season:
Manifold Markets
I’ve traded actively on Manifold Markets since Feb 2022; I find the interface enjoyable and it’s fun/addictive to accumulate more Manifold dollars (Mana).
Overall I’m doing pretty well; I’m 13th on the profit leaderboard and think I have one of the highest percentage returns. I’ve turned M$1,000 into M$8,420 so far.
My biggest (resolved) profits thus far come from correcting NBA markets to real money betting odds, correcting a high school science bowl market that was for some reason highly traded and off, and betting against Carrick Flynn winning the Democratic primary (this feels like the most real epistemic win).
Selected tournaments
Arising Intelligence
I got 70th/146 forecasters on Arising Intelligence 2022 on Hypermind, slightly above average despite missing badly on 2 of the questions (the crowd also missed very badly). Will be interesting to see how I do in 2023-2025; my updated forecasts are here, and my previous round of forecasts is here.
OpenPhil / Metaculus Forecasting AI Progress
I’m in 15th place overall on the overall leaderboard for the Forecasting AI Progress tournament, which seems not great relative to the number of predictors per question. As I’ve previously described, the tournament felt like it was mostly testing the skill of forming continuous distributions from trend extrapolation. I believe I made a basic error of not accounting for seasonality when extrapolating ArXiv paper counts in at least one of the rounds.
COVID tournaments
I did fairly well in the Metaculus COVID tournaments that I actively participated in: I’m in first place in the Salk tournament regarding vaccines and got first place in the first COVID lightning round tournament (but 12th in the second).
Interesting mistakes
AI benchmarks
I (and the crowd) badly underestimated progress on MATH and MMLU. For more details on the mistake, see Jacob Steinhardt’s blog post and my clarifying comment.
A mistake I made applicable to both benchmarks was underestimating the low-hanging fruits of improvement that were likely to be picked following the surge of interest in LMs after GPT-3.
Additionally, I didn’t respect the base rate for fast progress on new NLP benchmarks enough. Owain Evans comments on this here; I haven’t seen a thorough analysis done but my guess is that since transformers have been introduced progress is often very fast after the release of a benchmark; see e.g. SuperGLUE having human performance beaten in ~1.5 years.
I made a few additional mistakes in the MATH case:
My updated predictions on these questions can be found here. I’ve also updated from median 2047 to 2033 on AI winning an IMO gold medal.
On AGI/TAI timelines, I still feel somewhat skeptical that (approximately) GPT-n will scale to AGI but my skepticism has reduced a bit, and I think the MATH result indicates a speedup for this pathway. I should also update a bit on the meta level regarding the reliability of my skepticisms. Before my TAI timelines were roughly similar to Holden’s here: “more than a 10% chance we'll see transformative AI within 15 years (by 2036); a ~50% chance we'll see it within 40 years (by 2060); and a ~2/3 chance we'll see it this century (by 2100)”. I’m now at ~20% by 2036; my median is now ~2050 though still with a fat right tail.
My timelines shortening should also increase my p(AI doom by 2100) a bit, though I’m still working out my views here. I’m guessing I’ll land somewhere between 20 and 60%.
I’m also generally confused about how to translate progress on tangible benchmarks like MATH into updated bottom line forecasts. My current method feels very brittle in that I’m making some intuitive adjustment based on “surprisingness”, but I don’t have a clear model of which capabilities are missing and what would make me update on their feasibility. I’m working on building up a more robust model of AI progress to get better at this, but it will probably take a little while. My sense is that very few people have anything close to a good enough model here to reliably differentiate how to update on new evidence.
COVID waves
I consistently underestimated future COVID waves after the initial one, and especially after much of the population was vaccinated and/or naturally immune; I had a naive model of how herd immunity would work and failed to account enough for substantially increased transmission, re-infections, vaccine escape, etc. I probably would have done well to study past pandemics more in depth and take lessons from them. I should also perhaps generally be skeptical of simplistic models like “we’ll have herd immunity and things will peter out indefinitely”.
For example, on When will the US pass 1 million cumulative deaths from COVID-19? I predicted as far to the right edge as I was allowed to, then lost 212 points. I was very skeptical deaths would continue this rapidly after vaccinations.
General reflections
Some quick intuitions and hot takes I have based on my time forecasting include:
Acknowledgments
Thanks to Nuño Sempere and Molly Hickman for feedback.
I’d really liked to have gone more in-depth, especially with more case studies and learnings, but the writing of this already dragged on too long.
due to the recently discontinued final forecast bonus
Which seems to be broken as of July 15, 2022 when I’m revising this?
From a quick search, my understanding of the literature is that it shows some improvement from one hour training. This paper also reports an improvement from training but doesn’t specify how long was spent on the training. I’m not aware of any papers studying whether improvement can continue to occur from lots of training/practice.