Romeo Dean and I ran a slightly modified version of this format for members of AISST and we found it a very useful and enjoyable activity!
We first gathered to do 2 hours of reading and discussing, and then we spent 4 hours switching between silent writing and discussing in small groups.
The main changes we made are:
We think the first change made it better, but in hindsight we would have reduced the number of benchmarks to around 3 (GPQA, SWE-bench and LMSys ELO), or given participants much more time.
From the Rough Notes section of Ajeya's shared scenario:
Meta and Microsoft ordered 150K GPUs each, big H100 backlog. According to Lennart's BOTECs, 50,000 H100s would train a model the size of Gemini in around a month (assuming 50% utilization)
Just to check my understanding, here's my BOTEC of the number of FLOPs for 50k H100s during a month: 5e4 H100s * 1e15 bf16 FLOPs/second * 0.5 utilization * (3600 * 24 * 30) seconds/month = 6.48e25 FLOPs.
This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs). ETA: see clarification in Eli's reply.
I'm curious if we have info about the floating point format used for these training runs: how confident are we that labs are using bf16 rather than fp8?
This is indeed close enough to Epoch's median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.
Disclaimer: While some participants and organizers of this exercise work in industry, no proprietary info was used to inform these scenarios, and they represent the views of their individual authors alone.
Overview
In the vein of What 2026 Looks Like and AI Timelines discussion, we recently hosted a scenario forecasting workshop. Participants first wrote a 5-stage scenario forecasting what will happen between now and ASI. Then, they reviewed, discussed, and revised scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”.
Instructions for running the workshop including notes on what we would do differently are available here. We’ve put 6 shared scenarios from our workshop in a publicly viewable folder here.
Motivation
Writing scenarios may help to:
Running the workshop
Materials and instructions for running the workshop including notes on what we would do differently are available here.
The schedule for the workshop looked like:
Session 1 involved writing a 5-staged scenario forecasting what will happen between now and ASI.
Session 2 involved reviewing, discussing, and revising scenarios in groups of 3. The discussion was guided by forecasts like “If I were to observe this person’s scenario through stage X, what would my ASI timelines median be?”. There were analogous questions for p(disempowerment) and p(good future).
Session 3 was freeform discussion and revision within groups, then there was a brief session for feedback.
Workshop outputs and learnings
6 people (3 anonymous, 3 named) have agreed to share their scenarios. We’ve put them in a publicly viewable folder here.
We received overall positive feedback, with nearly all 23 people who filled out the feedback survey saying it was a good use of their time. In general, people found the writing portion more valuable than the discussion. We’ve included some ideas on how to improve future similar workshops based on this feedback and a few other pieces of feedback in our instructions for organizers. It’s possible that a workshop that is much more focused on the writing relative to the discussion would be more valuable.
Speaking for myself (as Eli), I think it was mostly valuable as a forcing function to get people to do an activity they had wanted to do anyway. And scenario writing seems like a good thing for people to spend marginal time on (especially if they find it fun/energizing). It seems worthwhile to experiment with the format (in the ways we suggest above, or other ways people are excited about). It feels like there might be something nearby that is substantially more valuable than our initial pilot.