x

LESSWRONG
LW

Sandbagging (AI) — LessWrong

Sandbagging (AI)

Edited by Raemon last updated 27th Mar 2025

Sandbagging is when an AI system pretends to be less capable during training/evaluation.

Add Posts

Posts tagged Sandbagging (AI)

11

80Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck, Julian Stastny

9mo

3

5

55Notes on countermeasures for exploration hacking (aka sandbagging)

ryan_greenblatt

11mo

6

2

80White Box Control at UK AISI - Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

7mo

10

2

61The “no sandbagging on checkable tasks” hypothesis

3y

14

2

44Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

11mo

0

2

8Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

4mo

0

1

84[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown, Francis Rhys Ward

2y

10

1

53Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black, Oliver Sourbut

7mo

5

1

50An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter, Francis Rhys Ward

2y

13

1

35Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter, Cleo Nardo

10mo

3

1

30How to mitigate sandbagging

Teun van der Weij

11mo

0

1

18Adding noise to a sandbagging model can reveal its true capabilities

7mo

1

1

17Exploration hacking: can reasoning models subvert RL?

Damon Falck, Joschka Braun, Eyon Jang

6mo

4

1

15Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Joe Benton, Zachary Witten

1y

1

1

10Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks

1mo

0

Load More (15/17)

Add Posts