This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Wikitags
LW
Login
Subscribe
Discussion
0
Sandbagging (AI)
Subscribe
Discussion
0
Written by
Raemon
last updated
27th Mar 2025
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Posts tagged
Sandbagging (AI)
Most Relevant
3
43
Notes on countermeasures for exploration hacking (aka sandbagging)
Ω
ryan_greenblatt
19d
Ω
6
2
56
The “no sandbagging on checkable tasks” hypothesis
Ω
Joe Carlsmith
2y
Ω
14
2
41
Automated Researchers Can Subtly Sandbag
Ω
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
,
Fabien Roger
17d
Ω
0
1
84
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Ω
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
,
Francis Rhys Ward
10mo
Ω
10
1
45
An Introduction to AI Sandbagging
Ω
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
Ω
13
1
23
How to mitigate sandbagging
Teun van der Weij
20d
0
1
15
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Ω
Joe Benton
,
Zachary Witten
2mo
Ω
1