x

LESSWRONG
LW

AI Benchmarking — LessWrong

AI Benchmarking

This page is a stub.

Add Posts

Posts tagged AI Benchmarking

2

79AI benchmarking has a Y-axis problem

10d

3

2

61FrontierMath Score of o3-mini Much Lower Than Claimed

1y

7

2

49Introducing BenchBench: An Industry Standard Benchmark for AI Strength

11mo

0

2

24Broken Benchmark: MMLU

2y

5

2

15The real reason AI benchmarks haven’t reflected economic impacts

10mo

0

1

93Every Benchmark is Broken

23d

0

1

71Some lessons from the OpenAI-FrontierMath debacle

1y

9

1

46A Guide For LLM-Assisted Web Research

nikos, dschwarz, Lawrence Phillips, FutureSearch

8mo

3

1

45Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose, shrutidattagupta

1y

8

1

34"Superhuman" Isn't Well Specified

9mo

9

1

33Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery, agg

2y

0

1

30Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev, Marius Hobbhahn

1y

0

1

27Reasons to care about Canary Strings

2mo

3

1

21ARC-AGI-2 human baseline surpassed (updated)

2mo

3

1

20Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj, Sai Sasank Y

2y

0

Load More (15/32)

Add Posts