x

LESSWRONG

LW

Zachary Witten — LessWrong

Zachary Witten

Zachary Witten

Message

77

1

4

3y

Zachary Witten

77

3y

Won't vs. Can't: Sandbagging-like Behavior from Claude Models

by Joe Benton and Zachary Witten

In a recent Anthropic Alignment Science blog post, we discuss a particular instance of sandbagging we sometimes observe in the wild: models sometimes claim that they can't perform a task, even when they can, to avoid performing tasks they perceive as harmful. For example, Claude 3 Sonnet can totally draw...

Feb 19, 2025•15

Content Features Aren't Enough for Detecting Toxicity. One Needs User Features.

Take it from a soldier on the front lines of the war on bad posts: You can't catch all the bad posts just by reading them. You need to look at who's making the posts. Every social media company knows this. I don't know if OpenAI knows it yet. But...

Feb 14, 2023•11