Håvard Tveit Ihle

Håvard Tveit Ihle's Shortform

WeirdML Time Horizons

Time horizon vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8]. Key finding: WeirdML time horizons roughly double every 5 months, from...

Feb 1688

Can LLMs Coordinate? A Simple Schelling Point Experiment

TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other's responses. Models did well on concrete prompts like "A capital in Europe" → "Paris", but did worse than I expected on more open ended...

Oct 15, 202535

Inference costs for hard coding tasks halve roughly every two months

TL;DR: By analyzing data from WeirdML and Aider Polyglot on score vs cost we find that the inference cost to achieve a certain score halves roughly every two monts. For example, to achieve the same score on WeirdML as gpt-4 (from June 2023), which cost about 0.6 $, we can...

Sep 17, 202516

Is the gap between open and closed models growing? Evidence from WeirdML

Previous post: Introducing the WeirdML Benchmark WeirdML is a benchmark challenging LLMs to solve a set of weird and unusual machine learning tasks designed to require careful thinking and understanding of the data and its properties. We have recently run all the major historical models we could find, going back...

Aug 5, 20257

Introducing the WeirdML Benchmark

WeirdML website Related posts: How good are LLMs at doing ML on an unknown dataset? o1-preview is pretty good at doing ML on an unknown dataset Introduction How good are Large Language Models (LLMs) at doing machine learning on novel datasets? The WeirdML benchmark presents LLMs with weird and unusual...

Jan 16, 202557

o1-preview is pretty good at doing ML on an unknown dataset

Previous post: How good are LLMs at doing ML on an unknown dataset? A while back I ran some evaluation tests on GPT4o, Claude Sonnet 3.5 and Gemini advanced to see how good they were at doing machine learning on a completely novel, and somewhat unusual dataset. The data was...

Sep 20, 202467

LESSWRONG
LW

LESSWRONG
LW

Håvard Tveit Ihle

Håvard Tveit Ihle

WeirdML Time Horizons

o1-preview is pretty good at doing ML on an unknown dataset

Introducing the WeirdML Benchmark

Can LLMs Coordinate? A Simple Schelling Point Experiment

Håvard Tveit Ihle

WeirdML Time Horizons

o1-preview is pretty good at doing ML on an unknown dataset

Introducing the WeirdML Benchmark

Can LLMs Coordinate? A Simple Schelling Point Experiment