Håvard Tveit Ihle

WeirdML Time Horizons

Time horizon vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8]. Key finding: WeirdML time horizons roughly double every 5 months, from ~24 minutes (GPT-4, June 2023) to ~38 hours (Claude Opus 4.6, February 2026). ModelReleaseTime horizon (95% CI)Claude Opus 4.6 (adaptive)Feb 202637.7 h [21.6 h, 62.4 h]GPT-5.2 (xhigh)Dec 202530.6 h [18.3 h, 54.4 h]Gemini 3 Pro (high)Nov 202522.3 h [14.4 h, 36.2 h]GPT-5 (high)Aug 202514.5 h [8.6 h, 24.1 h]o3-pro (high)Jun 202511.8 h [7.2 h, 18.9 h]o4-mini (high)Apr 20258.4 h [5.8 h, 13.6 h]o1-previewSep 20246.2 h [4.2 h, 10.5 h]Claude 3.5 SonnetJun 20241.9 h [59 min, 3.5 h]Claude 3 OpusMar 20241.1 h [16 min, 2.3 h]GPT-4Jun 202324 min [4 min, 51 min] Inspired by METR's work on AI time-horizons (paper) I wanted to do the same for my WeirdML data. WeirdML is my benchmark — supported by METR and included in the Epoch AI benchmarking hub and Epoch Capabilities Index — asking LLMs to solve weird and unusual ML tasks (for more details see the WeirdML page). Lacking the resources to pay humans to solve the WeirdML tasks and measure the time, I asked LLMs to predict how long a median human AI researcher (with no AI assistance) would take to solve the different WeirdML tasks at various score thresholds (25%, 50%, 70%, 90% and 95%). I gave the LLMs all the help I could, including a detailed task description, a detailed specification of the human baseline and affordances given to the human, LLM submitted code (from WeirdML runs) for each score threshold (where available) together with terminal outputs and associated scores (to give the LLMs some sense of how hard it is to score at a certain level on each task), full details below. The results look pretty nice, but should be taken with a large grain of salt, given that we know no actual human

88Feb 16

Håvard Tveit Ihle

Message

AI researcher, former cosmologist. Homepage.

344

WeirdML Time Horizons

Feb 1688

Can LLMs Coordinate? A Simple Schelling Point Experiment

TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other's responses. Models did well on concrete prompts like "A capital in Europe" → "Paris", but did worse than I expected on more open ended...

Oct 15, 202535

Inference costs for hard coding tasks halve roughly every two months

TL;DR: By analyzing data from WeirdML and Aider Polyglot on score vs cost we find that the inference cost to achieve a certain score halves roughly every two monts. For example, to achieve the same score on WeirdML as gpt-4 (from June 2023), which cost about 0.6 $, we can...

Sep 17, 202515

Is the gap between open and closed models growing? Evidence from WeirdML

Previous post: Introducing the WeirdML Benchmark WeirdML is a benchmark challenging LLMs to solve a set of weird and unusual machine learning tasks designed to require careful thinking and understanding of the data and its properties. We have recently run all the major historical models we could find, going back...

Aug 5, 20257

Introducing the WeirdML Benchmark

WeirdML website Related posts: How good are LLMs at doing ML on an unknown dataset? o1-preview is pretty good at doing ML on an unknown dataset Introduction How good are Large Language Models (LLMs) at doing machine learning on novel datasets? The WeirdML benchmark presents LLMs with weird and unusual...

Jan 16, 202557

o1-preview is pretty good at doing ML on an unknown dataset

Previous post: How good are LLMs at doing ML on an unknown dataset? A while back I ran some evaluation tests on GPT4o, Claude Sonnet 3.5 and Gemini advanced to see how good they were at doing machine learning on a completely novel, and somewhat unusual dataset. The data was...

Sep 20, 202467

How good are LLMs at doing ML on an unknown dataset?

I just ran two evaluation tests on each of the three leading LLM chatbots, GPT4o, Claude Sonnet 3.5 and Gemini advanced. In the challenge the models were presented with a novel dataset, and were asked to develop a ML model to do supervised classification of the data into 5 classes....

Jul 1, 202433

LESSWRONG
LW

LESSWRONG
LW

Håvard Tveit Ihle

Håvard Tveit Ihle

Håvard Tveit Ihle

WeirdML Time Horizons

o1-preview is pretty good at doing ML on an unknown dataset

Introducing the WeirdML Benchmark

Can LLMs Coordinate? A Simple Schelling Point Experiment

Håvard Tveit Ihle

WeirdML Time Horizons

Can LLMs Coordinate? A Simple Schelling Point Experiment

Inference costs for hard coding tasks halve roughly every two months

Is the gap between open and closed models growing? Evidence from WeirdML

Introducing the WeirdML Benchmark

o1-preview is pretty good at doing ML on an unknown dataset

How good are LLMs at doing ML on an unknown dataset?

WeirdML Time Horizons

o1-preview is pretty good at doing ML on an unknown dataset

Introducing the WeirdML Benchmark

Can LLMs Coordinate? A Simple Schelling Point Experiment

WeirdML Time Horizons

Can LLMs Coordinate? A Simple Schelling Point Experiment

Inference costs for hard coding tasks halve roughly every two months

Is the gap between open and closed models growing? Evidence from WeirdML

Introducing the WeirdML Benchmark

o1-preview is pretty good at doing ML on an unknown dataset

How good are LLMs at doing ML on an unknown dataset?