Tom Tseng

7mo

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem. Anthropic employs this approach with Claude 4 Opus through constitutional classifiers, while Google DeepMind and OpenAI have announced similar plans. We tested how well multi-layered defence approaches like these work by constructing and attacking our own layered defence pipeline, in collaboration with researchers from the UK AI Security Institute.

We find that multi-layer defenses can offer significant protection against conventional attacks. We wondered:... (read 1139 more words →)

1

13

Does robustness improve with scale?

ChengCheng

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, AdamGleave

2y

Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input. We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models.

We study models in the classification setting as there is... (read 183 more words →)

14

Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis

3y

In March 2016, AlphaGo defeated the Go world champion Lee Sedol, winning four games to one. Machines had finally become superhuman at Go. Since then, Go-playing AI has only grown stronger. The supremacy of AI over humans seemed assured, with Lee Sedol commenting they are an "entity that cannot be defeated". But in 2022, amateur Go player Kellin Pelrine defeated KataGo, a Go program that is even stronger than AlphaGo. How?

It turns out that even superhuman AIs have blind spots and can be tripped up by surprisingly simple tricks. In our new paper, we developed a way to automatically find vulnerabilities in a "victim" AI system by training an adversary AI system... (read 2808 more words →)

22

130

LESSWRONG
LW

LESSWRONG
LW

Tom Tseng

Tom Tseng

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Does robustness improve with scale?

Even Superhuman Go AIs Have Surprising Failure Modes

Tom Tseng

Tom Tseng

Tom Tseng

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Does robustness improve with scale?

Even Superhuman Go AIs Have Surprising Failure Modes