WeirdML Time Horizons
Time horizon vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8]. Key finding: WeirdML time horizons roughly double every 5 months, from ~24 minutes (GPT-4, June 2023) to ~38 hours (Claude Opus 4.6, February 2026). ModelReleaseTime horizon (95% CI)Claude Opus 4.6 (adaptive)Feb 202637.7 h [21.6 h, 62.4 h]GPT-5.2 (xhigh)Dec 202530.6 h [18.3 h, 54.4 h]Gemini 3 Pro (high)Nov 202522.3 h [14.4 h, 36.2 h]GPT-5 (high)Aug 202514.5 h [8.6 h, 24.1 h]o3-pro (high)Jun 202511.8 h [7.2 h, 18.9 h]o4-mini (high)Apr 20258.4 h [5.8 h, 13.6 h]o1-previewSep 20246.2 h [4.2 h, 10.5 h]Claude 3.5 SonnetJun 20241.9 h [59 min, 3.5 h]Claude 3 OpusMar 20241.1 h [16 min, 2.3 h]GPT-4Jun 202324 min [4 min, 51 min] Inspired by METR's work on AI time-horizons (paper) I wanted to do the same for my WeirdML data. WeirdML is my benchmark — supported by METR and included in the Epoch AI benchmarking hub and Epoch Capabilities Index — asking LLMs to solve weird and unusual ML tasks (for more details see the WeirdML page). Lacking the resources to pay humans to solve the WeirdML tasks and measure the time, I asked LLMs to predict how long a median human AI researcher (with no AI assistance) would take to solve the different WeirdML tasks at various score thresholds (25%, 50%, 70%, 90% and 95%). I gave the LLMs all the help I could, including a detailed task description, a detailed specification of the human baseline and affordances given to the human, LLM submitted code (from WeirdML runs) for each score threshold (where available) together with terminal outputs and associated scores (to give the LLMs some sense of how hard it is to score at a certain level on each task), full details below. The results look pretty nice, but should be taken with a large grain of salt, given that we know no actual human