This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Wikitags
LW
Login
Subscribe
Discussion
0
1
Adversarial Training
Subscribe
Discussion
0
1
This page is a stub.
Posts tagged
Adversarial Training
Most Relevant
2
157
Ironing Out the Squiggles
Zack_M_Davis
11mo
36
2
143
Takeaways from our robust injury classifier project [Redwood Research]
Ω
dmz
3y
Ω
12
2
142
High-stakes alignment via adversarial training [Redwood Research report]
Ω
dmz
,
LawrenceC
,
Nate Thomas
3y
Ω
29
2
124
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Ω
scasper
1y
Ω
30
2
100
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Ω
Buck
5mo
Ω
4
2
87
Solving adversarial attacks in computer vision as a baby version of general AI alignment
Ω
Stanislav Fort
7mo
Ω
8
2
40
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Ω
Buck
3y
Ω
0
2
30
Adversarial Robustness Could Help Prevent Catastrophic Misuse
Ω
aog
1y
Ω
18
2
25
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Ω
scasper
8mo
Ω
0
2
17
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
1y
0
2
16
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
Ω
DanielFilan
3y
Ω
0
2
9
Some thoughts on why adversarial training might be useful
Ω
Beth Barnes
3y
Ω
6
1
52
Latent Adversarial Training
Ω
Adam Jermyn
3y
Ω
13
1
41
Beyond the Board: Exploring AI Robustness Through Go
Ω
AdamGleave
9mo
Ω
2
1
30
EIS IX: Interpretability and Adversaries
Ω
scasper
2y
Ω
8