This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
is fundraising!
Tags
LW
$
Login
AI Control
•
Applied to
Reduce AI Self-Allegiance by saying "he" instead of "I"
by
Knight Lee
3d
ago
•
Applied to
Measuring whether AIs can statelessly strategize to subvert security measures
by
Alex Mallen
7d
ago
•
Applied to
A toy evaluation of inference code tampering
by
Gunnar_Zarncke
17d
ago
•
Applied to
The Queen’s Dilemma: A Paradox of Control
by
Raemon
1mo
ago
•
Applied to
Why imperfect adversarial robustness doesn't doom AI control
by
Raemon
1mo
ago
•
Applied to
Using Dangerous AI, But Safely?
by
Raemon
1mo
ago
•
Applied to
Sabotage Evaluations for Frontier Models
by
Buck
1mo
ago
•
Applied to
Win/continue/lose scenarios and execute/replace/audit protocols
by
Buck
1mo
ago
•
Applied to
Toward Safety Cases For AI Scheming
by
Mikita Balesni
2mo
ago
•
Applied to
Dario Amodei's "Machines of Loving Grace" sound incredibly dangerous, for Humans
by
Super AGI
2mo
ago
•
Applied to
A Brief Explanation of AI Control
by
Aaron_Scher
2mo
ago
•
Applied to
Coup probes: Catching catastrophes with probes trained off-policy
by
StefanHex
2mo
ago
•
Applied to
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
by
Buck
3mo
ago
•
Applied to
Schelling game evaluations for AI control
by
Olli Järviniemi
3mo
ago
•
Applied to
How to prevent collusion when using untrusted models to monitor each other
by
Buck
3mo
ago
•
Applied to
Secret Collusion: Will We Know When to Unplug AI?
by
schroederdewitt
3mo
ago