Unpredictability and the Increasing Difficulty of AI Alignment for Increasingly Intelligent AI
Abstract The question, how increasing intelligence of AIs influences central problems in AI safety remains neglected. We use the framework of reinforcement learning to discuss what continuous increases in the intelligence of AI systems implies for central problems in AI safety. We first argue that predicting the actions of an AI gets increasingly difficult for increasingly intelligent systems, and is impossible for sufficiently intelligent AIs. We then briefly argue that solving the alignment problem requires predicting features of the AI agent like its actions or goals. Based on this, we give a general argument that for increasingly intelligent AIs, the alignment problem becomes more difficult. In brief, this is the case since the increased capabilities of such systems require predicting the goals or actions of such systems increasingly precisely to ensure sufficient safety and alignment. We then consider specific problems of different approaches to the alignment problem. Finally, we conclude by discussing what this increasing difficulty means for the chances of catastrophic risks due to a potential failure to solve the alignment problem. 1. Introduction Several papers have argued that sufficiently advanced artificial intelligence will be unpredictable and unexplainable [Yampolskiy, 2020a, 2020b, 2022]. Within AI safety, there further seems to be broad agreement that AI alignment, that is broadly speaking making AIs try to do what we want, becomes more difficult as AI systems get more intelligent. However, the arguments for unpredictability have been brief and rather informal while not being integrated with current machine learning concepts. Further, the argument that alignment becomes more difficult with increasing intelligence, has mostly been informally motivated by various potential failure modes such as instrumental convergence, misgeneralization, intelligence explosion, or deceptive alignment rather than an overarching theoretical account. Both argument