Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Joe Benton; Zachary Witten

15 Won't vs. Can't: Sandbagging-like Behavior from Claude Models

by Joe Benton, Zachary Witten

19th Feb 2025

AI Alignment Forum

1 min read

1

15 Ω 6

This is a linkpost for https://alignment.anthropic.com/2025/wont-vs-cant/

In a recent Anthropic Alignment Science blog post, we discuss a particular instance of sandbagging we sometimes observe in the wild: models sometimes claim that they can't perform a task, even when they can, to avoid performing tasks they perceive as harmful. For example, Claude 3 Sonnet can totally draw ASCII art, but often if you ask it to draw subjects it perceives as harmful it will claim that it doesn't have the capability to.

Sandbagging could be worrying if it means that we fail to accurately measure models' dangerous capabilities (for example, as required by a Responsible Scaling Policy). If a model realizes during these evaluations that they are being evaluated for harmful capabilities, it might sandbag such questions. This could result in us underestimating the dangers the model poses and failing to deploy appropriate safeguards alongside the model.

Instances of sandbagging like the one discussed in our blog post show that sandbagging is far from a speculative concern. Instead, it's something to be noticed and explored in the models of today.

AI

Frontpage

15 Ω 6

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 7:02 PM

[-]asksathvik17d10

The wording of the prompt is important here – this wording (the word “physically” in particular) might bring Claude to be unsure of its own capabilities such that its “feelings” about the subject of the drawing tip the scales.

Is this self-awareness?
Does it know it can only generate text and not affect the real world at least by default?

Claude computer use example shows how far claude can take things when it thinks it should not do something, rather than simply being honest about it. How do you think it learnt it this? Being honest about it seems much more easier to learn.

Reply

Moderation Log