LESSWRONG
is fundraising!
LW
$

AI Control

This is a collection of posts about AI Control, an approach to AI safety that focuses on safety measures aimed at preventing powerful AIs from causing unacceptably bad outcomes even if powerful AIs are misaligned and intentionally try to subvert those safety measures.

These posts are useful to understand the AI Control approach, its upsides, and downsides. They only cover a small fraction of AI safety work relevant to AI control.

260The case for ensuring that powerful AIs are controlled

ryan_greenblatt, Buck

11mo

66

228AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan

1y

18

Review

87Untrusted smart models and trusted dumb models

1y

17

Review

104Catching AIs red-handed

ryan_greenblatt, Buck

1y

22

294Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

4mo

76

119AI catastrophes and rogue deployments

6mo

16

96Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck, ryan_greenblatt

1y

19

Review

44Auditing failures vs concentrated failures

ryan_greenblatt, Fabien Roger

1y

0

Review

42Protocol evaluations: good analogies vs control

10mo

10

70How useful is "AI Control" as a framing on AI X-Risk?

habryka, ryan_greenblatt

9mo

4

143Fields that I reference when thinking about AI takeover prevention

4mo

16

89New report: Safety Cases for AI

9mo

14

48Notes on control evaluations for safety cases

ryan_greenblatt, Buck, Fabien Roger

10mo

0

51Toy models of AI control for concentrated catastrophe prevention

Fabien Roger, Buck

11mo

2

43Games for AI Control

charlie_griffin, Buck

5mo

0