LESSWRONGTags
LW

Adversarial Training

•

Applied to High-stakes alignment via adversarial training [Redwood Research report] by Zach Stein-Perlman 1mo ago

•

Applied to Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming by Buck 1mo ago

•

Applied to Solving adversarial attacks in computer vision as a baby version of general AI alignment by Stanislav Fort 3mo ago

•

Applied to Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by RobertM 4mo ago

•

Applied to Does robustness improve with scale? by ChengCheng 4mo ago

•

Applied to Beyond the Board: Exploring AI Robustness Through Go by AdamGleave 5mo ago

•

Applied to Ironing Out the Squiggles by Zack_M_Davis 7mo ago

•

Applied to Some thoughts on why adversarial training might be useful by Zach Stein-Perlman 10mo ago

•

Applied to Adversarial Robustness Could Help Prevent Catastrophic Misuse by aogara 1y ago

•

Applied to Deep Forgetting & Unlearning for Safely-Scoped LLMs by scasper 1y ago

•

Applied to AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training by jacobjacob 1y ago

•

Applied to AI Safety 101 - Chapter 5.1 - Debate by Charbel-Raphaël 1y ago

•

Applied to Against Almost Every Theory of Impact of Interpretability by Charbel-Raphaël 1y ago

•

Applied to Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren 1y ago

•

Applied to EIS IX: Interpretability and Adversaries by scasper 2y ago

•

Applied to EIS XII: Summary by scasper 2y ago

•

Applied to EIS XI: Moving Forward by scasper 2y ago

•

Applied to Takeaways from our robust injury classifier project [Redwood Research] by Ruby 2y ago

•

Applied to Oversight Leagues: The Training Game as a Feature by janus 2y ago