TLDR: Skip to the last image at the bottom of my post to see the visualization.
I don't usually post here, but I know you folks like forecasting and timelines, so I think you might appreciate this. The recent time horizons paper by METR is great, but it only visualized tasks that models could complete with 50-80% accuracy. I think it's highly useful, and arguably even more useful, to see when capabilities would hit 95% or 99% accuracy or beyond for various time horizons.
So I went through the paper with the goal of getting the information that will allow such a visualization to be created, and I found distributions shown of model-specific accuracy levels for specific time horizon buckets.
I noted these down and noticed there is similar, if not identical slopes for those higher accuracy levels compared to the lower ones when plotted against time horizons(they also noted in the paper how the slope for 80% accuracy was conveniently very similar to the slope for 50% accuracy too), So this makes the visualization math easier as well, I keep the slope steepness the same while just needing to apply the appropriate Y-axis adjustment relative to release date.
I noted down the multiple difference between the time horizons of 50% accuracy, 80% accuracy, 95% accuracy and 99% accuracy, and did this for the following 6 models:
- Claude-3.7-sonnet
- Claude-3.5-sonnet(new)
- O1
- GPT-4o
- Claude-3-Opus
- GPT-4-0314
The paper mentioned a 5X multiple difference they found between 50% and 80% accuracy, and when I averaged my numbers for these 6 models I arrived at the same figure for 50% to 80%, as for the other accuracy multiples, it looks like this:
Finally, here is the visualization I ended up making, with expanded x-axis, as well as the short-term trend slope added of reasoning models for each of the new 80%-99% accuracy trends, which was also mentioned in the original paper (doubling every 4 months with the data points of 2024-2025)
I think there is good reason to believe that the short term trend of 4 month doublings may be over optimistic, but I can also see many arguments as to why the long term 7 month doubling trend is under-estimating progress. So I think it's good to include both here, some even believe the trend can go beyond the 4 month doubling soon, and frankly I can see that being reasonably possible too.
Edit: I see people are upvoting but not commenting. As this is one of my first posts here I would appreciate if anyone is willing to give feedback on my formatting, whether there is better ways I can represent images, or even just about how this maybe influences your timelines? (or even reasons as to why it doesn't influence your timelines).
It looks like the images aren't showing up on LW.
Thanks, should be fixed now.