x

LESSWRONG
LW

Adversarial Examples (AI) — LessWrong

You are viewing version 1.2.0 of this page. Click here to view the latest version.

Adversarial Examples (AI)

Edited by Multicore, Ruby last updated 14th Dec 2024

You are viewing revision 1.2.0, last edited by Ruby

This page is a stub.

Add Posts

Posts tagged Adversarial Examples (AI)

8

675SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow, mwatkins

3y

208

3

165Ironing Out the Squiggles

2y

37

3

71AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave, EuanMcLean

3y

29

2

127Deep Forgetting & Unlearning for Safely-Scoped LLMs

2y

30

2

100Solving adversarial attacks in computer vision as a baby version of general AI alignment

2y

10

2

59Human beats SOTA Go AI by learning an adversarial policy

3y

29

2

38What progress have we made on automated auditing?

2y

1

2

35If I were a well-intentioned AI... I: Image classifier

Stuart_Armstrong

6y

4

2

31Adversarial Policies Beat Professional-Level Go AIs

3y

35

2

30Adversarial Robustness Could Help Prevent Catastrophic Misuse

2y

18

2

13The Goodhart Game

6y

5

2

12AXRP Episode 1 - Adversarial Policies with Adam Gleave

5y

5

2

5RAIN: Your Language Models Can Align Themselves without Finetuning - Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!

Singularian2501

2y

0

1

142High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC, Nate Thomas

4y

29

1

130Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, MichaelDennis

3y

22

Load More (15/35)

Add Posts