Time horizon vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8]. Key finding: WeirdML time horizons roughly double every 5 months, from...
TL;DR: I tested whether 5 reasoning models (GPT-5, Claude-4.5 Sonnet, Grok-4, Gemini-2.5-Pro, DeepSeek-R1) could coordinate on 75 short prompts when explicitly told to match each other's responses. Models did well on concrete prompts like "A capital in Europe" → "Paris", but did worse than I expected on more open ended...
TL;DR: By analyzing data from WeirdML and Aider Polyglot on score vs cost we find that the inference cost to achieve a certain score halves roughly every two monts. For example, to achieve the same score on WeirdML as gpt-4 (from June 2023), which cost about 0.6 $, we can...
Previous post: Introducing the WeirdML Benchmark WeirdML is a benchmark challenging LLMs to solve a set of weird and unusual machine learning tasks designed to require careful thinking and understanding of the data and its properties. We have recently run all the major historical models we could find, going back...
WeirdML website Related posts: How good are LLMs at doing ML on an unknown dataset? o1-preview is pretty good at doing ML on an unknown dataset Introduction How good are Large Language Models (LLMs) at doing machine learning on novel datasets? The WeirdML benchmark presents LLMs with weird and unusual...
Previous post: How good are LLMs at doing ML on an unknown dataset? A while back I ran some evaluation tests on GPT4o, Claude Sonnet 3.5 and Gemini advanced to see how good they were at doing machine learning on a completely novel, and somewhat unusual dataset. The data was...