LESSWRONG
LW

All of wnx's Comments + Replies

Introducing Alignment Stress-Testing at Anthropic

Alignment approaches at different abstraction levels (e.g., macro-level interpretability, scaffolding/module-level AI system safety, systems-level theoretic process analysis for safety) is something I have been hoping to see more of. I am thrilled by this meta-level red-teaming work and excited to see the announcement of the new team.

Shallow review of live agendas in alignment & safety

wnx1y21

Hey, great stuff -- thank you for sharing! I especially found this useful as somebody who has been "out" of alignment for 6 months and is looking to set up a new research agenda.