Scenario Forecasting Workshop: Materials and Learnings

elifland; charlie_griffin

Disclaimer: While some participants and organizers of this exercise work in industry, no proprietary info was used to inform these scenarios, and they represent the views of their individual authors alone.

Overview

In the vein of What 2026 Looks Like and AI Timelines discussion, we recently hosted a scenario forecasting workshop. Participants first wrote a 5-stage scenario forecasting what will happen between now and ASI. Then, they reviewed, discussed, and revised scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”.

Instructions for running the workshop including notes on what we would do differently are available here. We’ve put 6 shared scenarios from our workshop in a publicly viewable folder here.

Edit: Here is the template document for a simplified version of this workshop, which we ran at The Curve in late 2024.

Motivation

Writing scenarios may help to:

Clarify views, e.g. by realizing an abstract view is hard to concretize, or realizing that two views you hold don’t seem very compatible.
Surface new considerations, e.g. realizing a subquestion is more important than you thought, or that an actor might behave in a way you hadn’t considered.
Communicate views to others, e.g. clarifying what you mean by “AGI”, “slow takeoff”, or the singularity.
Register qualitative forecasts, which can then be compared against reality. This has advantages and disadvantages vs. more resolvable forecasts (though scenarios can include some resolvable forecasts as well!).

Running the workshop

Materials and instructions for running the workshop including notes on what we would do differently are available here.

The schedule for the workshop looked like:

Session 1 involved writing a 5-staged scenario forecasting what will happen between now and ASI.

Session 2 involved reviewing, discussing, and revising scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”. There were analogous questions for p(disempowerment) and p(good future).

Session 3 was freeform discussion and revision within groups, then there was a brief session for feedback.

Workshop outputs and learnings

6 people (3 anonymous, 3 named) have agreed to share their scenarios. We’ve put them in a publicly viewable folder here.

We received overall positive feedback, with nearly all 23 people who filled out the feedback survey saying it was a good use of their time. In general, people found the writing portion more valuable than the discussion. We’ve included some ideas on how to improve future similar workshops based on this feedback and a few other pieces of feedback in our instructions for organizers. It’s possible that a workshop that is much more focused on the writing relative to the discussion would be more valuable.

Speaking for myself (as Eli), I think it was mostly valuable as a forcing function to get people to do an activity they had wanted to do anyway. And scenario writing seems like a good thing for people to spend marginal time on (especially if they find it fun/energizing). It seems worthwhile to experiment with the format (in the ways we suggest above, or other ways people are excited about). It feels like there might be something nearby that is substantially more valuable than our initial pilot.

Romeo Dean and I ran a slightly modified version of this format for members of AISST and we found it a very useful and enjoyable activity!

We first gathered to do 2 hours of reading and discussing, and then we spent 4 hours switching between silent writing and discussing in small groups.

The main changes we made are:

We removed the part where people estimate probabilities of ASI and doom happening by the end of each other’s scenarios.
We added a formal benchmark forecasting part for 7 benchmarks using private Metaculus questions (forecasting values at Jan 31 2025):
1. GPQA
2. SWE-bench
3. GAIA
4. InterCode (Bash)
5. WebArena
6. Number of METR tasks completed
7. ELO on LMSys arena relative to GPT-4-1106

We think the first change made it better, but in hindsight we would have reduced the number of benchmarks to around 3 (GPQA, SWE-bench and LMSys ELO), or given participants much more time.

From the Rough Notes section of Ajeya's shared scenario:

Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)

Just to check my understanding, here's my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli's reply.

I'm curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).

FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.

Romeo Dean and I ran a slightly modified version of this format for members of AISST and we found it a very useful and enjoyable activity!

We first gathered to do 2 hours of reading and discussing, and then we spent 4 hours switching between silent writing and discussing in small groups.

The main changes we made are:

We removed the part where people estimate probabilities of ASI and doom happening by the end of each other’s scenarios.
We added a formal benchmark forecasting part for 7 benchmarks using private Metaculus questions (forecasting values at Jan 31 2025):
1. GPQA
2. SWE-bench
3. GAIA
4. InterCode (Bash)
5. WebArena
6. Number of METR tasks completed
7. ELO on LMSys arena relative to GPT-4-1106

We think the first change made it better, but in hindsight we would have reduced the number of benchmarks to around 3 (GPQA, SWE-bench and LMSys ELO), or given participants much more time.

From the Rough Notes section of Ajeya's shared scenario:

Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli's reply.

I'm curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?

This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).

FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.

LESSWRONG
LW

LESSWRONG
LW

50

Scenario Forecasting Workshop: Materials and Learnings

50

Ω 23

Overview

Motivation

Running the workshop

Workshop outputs and learnings

50

Ω 23

50

Ω 23