Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
This is the second post in the sequence “Interpretability Research for the Most Important Century”. The first post, which introduces the sequence, defines several terms, and provides a comparison to existing works, can be found here: Introduction to the sequence: Interpretability Research for the Most Important Century. Summary This post explores the extent to which interpretability is relevant to the hardest, most important parts of the AI alignment problem (property #1 of High-leverage Alignment Research[1]). First, I give an overview of the four important parts of the alignment problem (following Hubinger[2]): outer alignment, inner alignment, training competitiveness and performance competitiveness (jump to section). Next I discuss which of them is “hardest”, taking the position that it is inner alignment (if you have to pick just one), and also that it’s hard to find alignment proposals which simultaneously address all four parts well. Then, I move onto exploring how interpretability could impact these four parts of alignment. Our primary vehicle for this exploration involves imagining and analyzing seven best-case scenarios for interpretability research (jump to section). Each of these scenarios represents a possible endgame story for technical alignment, hinging on one or more potential major breakthroughs in interpretability research. The scenarios’ impacts on alignment vary, but usually involve solving inner alignment to some degree, and then indirectly benefiting outer alignment and performance competitiveness; impacts on training competitiveness are more mixed. Finally, I discuss the likelihood that interpretability research could contribute to unknown solutions to the alignment problem (jump to section). This includes examining interpretability’s potential to lead to breakthroughs in our basic understanding of neural networks and AI, deconfusion research and paths to solving alignment that are difficult to predict or otherwise not captured by the s
Actually the OGI-1 model (and to a lesser extent, the OGI-N model) does do something important to address loss of control risks from AGI (or ASI): it reduces competitive race dynamics.
There are plausible scenarios where it is technically possible for a lab to safely develop AGI, but where doing so would require them to slow down development. When they are competitively racing against other AGI projects, the incentives are (potentially much) stronger to proceed with risky development. But when a lab doesn't have to worry about competitors, then they at least have an opportunity to pursue costly safety measures without sacrificing their lead.