Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
A pdf version of this report is available here. Summary In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development. In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles. In Section 2 we argue that an AI which is capable of doing hard, novel science will be approximately consequentialist; that is, its behavior will be well described as taking actions in order to achieve an outcome. This is because the task has to be specified in terms of outcomes, and the AI needs to be robust to new obstacles in order to achieve these outcomes. In Section 3 we argue that novel science will necessarily require the AI to learn new things, both facts and skills. This means that an AI’s capabilities will change over time which is a source of dangerous distribution shifts. In Section 4 we further argue that training methods based on external behavior, which is how AI systems are currently created, are an extremely imprecise way to specify the goals we want an AI to ultimately pursue. This is because there are many degrees of freedom in goal specification that aren’t pinned down by behavior. AIs created this way will, by default, pursue unintended goals. In Section 5 we discuss why we expect oversight and control of powerful AIs to be difficult. It will be difficult to safely get useful work out of misaligned AIs while ensuring they don’t take unwanted actions, and therefore we don’t expect AI-assisted research to be both safe and much faster than current research. Final
2 is close enough. Extrapolating the results of safe tests to unsafe settings requires a level of theoretical competence we don't currently have. Steve Brynes just made a great post that is somewhat related, I endorse everything in that post.