AI Control is a subfield of AI Safety research focused on ensuring the safety of powerful AI systems, particularly under the assumption that they might be misaligned or could become intentionally subversive (A Brief Explanation of AI Control).
Unlike AI Alignment, which seeks to build inherently aligned systems, AI Control develops mechanisms to monitor, evaluate, constrain, verify, and manage the behavior of potentially untrustworthy AI to prevent catastrophic outcomes (The case for ensuring that powerful AIs are controlled).
This review outlines key AI control methods discussed in recent literature. For a broader exploration of related topics, refer to An overview of areas of control work.
1. Monitoring
Description: Monitoring encompasses techniques for observing AI system behavior, outputs,... (read 2542 more words →)
Could you discuss the motivations for why you think these are important (and the theories of change)? Though I know the deadline has passed, I'm keen to build a project in monitoring AI behaviors in production and incident tracking, so I'm curious to learn what research Coefficient Giving has done suggesting these have gaps.