We need Git for AI Timelines

fluxxrider

I was recently reading the AI Futures' Q1 2026 timelines update and noted their quarterly updates (the last one being in December, with the release of the AI Futures Model) are struggling to keep pace with the thing they're trying to track.

The pace of AI development is incredibly fast and only hastening; Kokotajlo's shortened his timelines for an AC by 18 months (late 2029 to mid 2028) in a single update due to 4 specific parameter changes. Five days later, Anthropic announced Claude Mythos Preview, which arguably invalidated some of the said parameters before the ink had time to dry.

This isn't a criticism of the AI Futures Project; they do commendable work. To be clear, Kokotajlo and the AI Futures Project are arguably the best at what they do in the world. His track record is remarkable, and AI2027 has sparked immense conversation about the future of AI/timelines (it's what got me into LW), but when the field changes completely in its pacing every two months, the community more often than not is navigating with an outdated map. And the problem is getting worse. Mythos hasn't yet been evaluated by METR, Spud hasn't released, and by the time the Q2 update drops, the field will have again shifted to another focal point.

But the cadence itself is the surface issue; updates aren't nearly granular enough to be tied back to each "step". When Kokotajlo updates his priors for an AC, we don't see the causal chain leading to each decision shortening his timelines by X amount. His rationale for the AC median being 1 year of autonomous work was that Opus 4.6 "impressed" him. But the actual definition of what 1 year even means remains muddy; the original AI2027 scenario had the median set at 6 months for an SC before moving it back to 3 years. The SC definition shift of 3y-1y accounted for around half of the 18 month shift in his Q1 update; the stated justification is Opus "impressed" him. Impressed how? At what point between December and April did he change his priors? The entire causal chain here collapses to a single word in a blog post.

In software engineering, this would be the equivalent to someone pushing a commit to main with a message "fixed stuff because it now works". You'd never accept that for code, so why would you accept that for a justifiable reason for the most important technological revolution in human history?

There's no unified platform where forecasters can independently publish their timelines with substantial backing/integration with the platform itself. Sure, you can write a Substack article, spin up a short LessWrong post, perhaps post a Twitter thread, but these are strung all over and are discontinuous for someone trying to get a concrete perspective of what different forecasters think. One might say Metaculus is the solution; while this is a way of congregating forecasts, it's still less than optimum. Conversation and rationale is walled behind "forecast and pay" without a congregational space to discuss the reasoning behind those forecasts (yes there is a comment feature but it is scarcely used). There was an excellent post around Broad Timelines that highlighted this; Metaculus highlights "medians" and less of a full distribution that's more sought after in our space.

As neo noted in said post, we need to "design info-UI tools that facilitate that (the timeline formulation) process". Broad distributions need platforms that can track how they update over time. A quarterly blog post cannot do that. Forecasts updated granularly over time with reasoning and deliberation behind them can.

Why I'm using Git here as an analogy; SWEing fixed this class of problem years ago. You had commits (changes in timeline predictions) diffs showing what changed, comments showing why they changed, branches for code (in this analogy, scenario) forks, blame for accountability (we need to be less wrong after all), and merge conflicts that require resolution rather than dissolving into Twitter discourse.

The minimum viable version of this is frankly embarrassingly simple. A GitHub repo with each forecaster maintaining a YAML file with their distribution for an agreed upon definition (whether it be an AC, SC, ASI etc.). Commits are updates to said files/timelines with rationale in the commit message.

Claude Opus 4.6 had a 80% time horizon of 70 minutes. Assuming Mythos has an 80% TH of ~240 min, the doubling time is ~34-40 days. Even if we're pessimistic at a time horizon of 180 minutes, the doubling time is still 45 days. The thing we're forecasting is now shorter than our update cycle.

The rationalist community, of all communities, should find that unacceptable.

It seems that your proposal is different from what we at AIFP already do in two ways:

(1) Lots of forecasters, not just AIFP, would maintain their distributions on the same platform. That would be convenient for viewers wanting to compare forecasts.

(2) The setup would encourage more frequent minor updates (e.g. me adjusting a parameter down slightly because I'm impressed by a model) whereas right now we are bundling our updates together into quarterly updates.

Is this right or am I missing something?

Yes, exactly! On (2) in particular, the current system forces updates into 1/4ly posts onto the continuous nature of progress- something like a commit system would let you push an update (eg Opus impressing you) that changes your timelines without having to slot it in a 500-word Substack article and corroborate with Eli etc.

We could probably implement (2) unilaterally, just by, well, doing updates to our existing website whenever we feel like it instead of quarterly. Do you think that's an improvement? I'll consider it at least. Honestly a consideration for me is reducing friction/effort, so I'm wary of committing to something that might prove to be a pain.

Absolutely! I love your broad timelines idea on the AI Futures site (the one where you can change your probability distribution on what happens when) but it crashed when I tried doing it 🙁

I’m floored you actually responded haha. I’ll be working to get something running in the meantime, perhaps you could consider it then! Do check your inbox.

Particularly if reality turns out to have short timelines I think something along these lines would be pretty valuable. Seems pretty doable to vibe code something like this?

I’d be glad to help you out where needed!

Claude Opus 4.6 had a 80% time horizon of 70 minutes. Assuming Mythos has an 80% TH of ~240 min, the doubling time is ~34-40 days. Even if we're pessimistic at a time horizon of 180 minutes, the doubling time is still 45 days. The thing we're forecasting is now shorter than our update cycle.

How certain are you about that? The Opus 4.6-> Mythos jump is due to a one-time increase of the amount of parameters, not due to some achievements of which we expect to see some analogue soon (e.g. scaffolding improvements or addition of primitive video games into the RL environment on which one could hope to iterate by adding more complex games)

I disagree that it's one-time, and think parameter scaling has a long way to go still.

We will eventually hit a limit, but that limit is measured in ASML machines, and nothing else matters nearly as much as ASML machines, which Dwarkesh talked here in the podcast.

In essence, Mythos is a return to form, where parameter scaling/compute scaling matters as much as data scaling, if not more.

I think you’re missing what he’s saying here.

Pre training was easy to scale in 22, 23 and 24. There was excess capacity. Mythos is likely the first >10b pertained model. The Claude4-4.6 paradigm was likely driven by one pre trained model with RLVF on top. Mythos is the new class of pre trained model and scaling and doubling times will be based on the speed of building RL models on top of Mythos.

I agree that Anthropic will attempt to add more RL tasks and potentially update Mythos's weights even after the pre-train, and this could effect doubling times, but my point was that Mythos suggests that you can in fact just scale upwards in parameter and pre-training compute multiple times, and that the memes of compute/pre-training scaling being dead weren't correct at all.

Indeed, 500x the compute of GPT-4 to train a GPT-6-level model by pre-training is probably possible by 2028 if you were willing to avoid using RL (though when it's deployed it's more likely to be in 2029 due to RL and inference soaking up compute), and if the scale-up of AI in 2026 is as potent as people believe (and indeed this is likely to happen once the nerfed models release sometime this year), AI companies can get enough revenue to comfortably build GPT-7, which would have 5000x the compute of GPT-4, which I suspect would be built in 2031.

But there is something I want to say here, and that is how the shift to ever larger post-training/RL could enable us to incrementally solve continual learning/learning on the job/continual weight updates like brains. One of the gifts of AI 2027 is that it points out in the January 2027 section that modulo some very important details like catastrophic forgetting of earlier tasks when later tasks are RLed in, fast enough weight updates for a long enough time via perpetually adding in more RL tasks is essentially equivalent to human-level continual learning, and while Agent-2 isn't quite there yet, it is on the path, and it's already able to continuously learn from the world all the time with it's weights, it's just slower than humans at this:

With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2.⁴² On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.⁴³ On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.

I expect something like this to ultimately be done once continual learning is targeted, albeit at a slower pace than what AI 2027 describes.

What are your timelines? Curious because there are rumors 'GPT-6' releases this year

Fair point, but I think this actually kind of strengthens both my argument and yours; the fact is that progress doesn't follow some smooth exponential. This is why I think it's more optimum to update our timelines iteratively. Perhaps Mythos was a one-time leap in capability that won't continue- this is great because it means we can update our priors and instead of bouncing back between extremes we can get a better picture of what our timelines look like.