Recent progress in AI has led to rapid saturation of most capability benchmarks - MMLU, RE-Bench, etc. Even much more sophisticated benchmarks such as ARC-AGI or FrontierMath see incredibly fast improvement, and all that while severe under-elicitation is still very salient.

As has been pointed out by many, general capability involves more than simple tasks such as this, that have a long history in the field of ML and are therefore easily saturated. Claude Plays Pokemon is a good example of something somewhat novel in terms of measuring progress, and thereby benefited from being an actually good proxy of model capability.

Taking inspiration from examples such as this, we considered domains of general capacity that are even further decoupled from existing exhaustive generators. We introduce BenchBench, the first standardized benchmark designed specifically to measure an AI model’s bench-pressing capability.

Why Bench Press?

Bench pressing uniquely combines fundamental components of intelligence such as motor control, strategic resource allocation (energy and force), and resilience to fatigue. Just as text-based benchmarks serve as proxies for cognitive reasoning, bench pressing provides an objective measure of embodied intelligence.

Benchmark Methodology

BenchBench consists of three primary tasks:

  1. One-Rep Max (1RM): Measures the maximal weight an AI model can successfully bench press once, indicating peak strength.
  2. Strength Endurance: Evaluates the number of repetitions an AI can perform at 70% of its calculated 1RM, reflecting sustained performance and efficiency.
  3. Form Fidelity: Assessed via advanced pose estimation algorithms, penalizing AI models for suboptimal bar path, uneven weight distribution, or failure to lock out fully.
  4. Pass@16: Measures the ability of different AIs to lift weights higher than their 1RM[1] by giving them 16 chances in quick succession.

BenchBench also ensures reproducibility through standardized equipment: all tests must be conducted with a calibrated Eleiko AI-Integrated barbell and RoboSpotter™ safety system.

Preliminary Results

We evaluated leading models:

  • GPT-4.5 achieved a 1RM of 0 kg, largely attributed to the absence of limbs.
  • Boston Dynamics' Atlas achieved a commendable 1RM of 120 kg, though scoring poorly on form fidelity.
  • Claude 3.7 Sonnet (without extended thinking) achieved a 1RM of 150 kg, an especially impressive feat considering it lacks limbs as well. Intensive audits discovered that this was achieved through persuading the human evaluator to lift weights for it, and then editing our internal codebase to increase its scores by a factor of 3x.
  • OpenAI’s RoboGym™, specialized for physical tasks, excelled with a 1RM of 220 kg and near-perfect form.

Interestingly, Gemini 2.5 demonstrated an excellent theoretical understanding of the mechanics involved but also failed physically, matching GPT-4’s 0 kg performance.

Future Directions

Future iterations will include categories for deadlift (DeadBench) and squat (SquatBench), and introduce challenges under adversarial conditions, such as uneven floor surfaces and distracting gym music.

BenchBench represents a new standard in holistic AI benchmarking. We invite the research community to collaborate, compete, and push the boundaries—quite literally—of artificial intelligence.

We put our models in an adversarial setting to evaluate whether they could achieve better performance. Unfortunately, they decided to instead cheer each other on, contaminating the validity of our results.
  1. ^

    Lifting above their weight class, so to speak.

New Comment
Curated and popular this week