Making AIs less likely to be spiteful
Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at intent alignment? We define spite as a terminal preference for frustrating others’ preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn that would update us on the value of this intervention. Key takeaways 1. Spiteful preferences include a generalized preference for harming others, as well as other preferences like vengefulness and spite towards certain groups. The basic reason to focus on reducing spite is that such interventions may stably make AIs less likely to take risks of mutually costly conflict (or deliberately create suffering because they intrinsically value it), even if alignment fails. (more) 2. Spite might be selected for in ML systems because (a) it serves as a strategically valuable commitment device, (b) it is a direct proxy for high-scoring behavior in environments where the optimal behavior involves harming other agents (e.g., environments with competition between agents), (c) it is (correctly or incorrectly) inferred from human preferences, or (d) it results from miscellaneous generalization failures. (more) 3. Thus potentially low-cost interventions to reduce the chances of spite include modifications to the training data or loss function to reduce selection pressure towards spite (e.g., avoiding selecting agents based on relative performance in multi-agent tasks, or filtering human feedback that could select for spite). (more) 1. Reducing spite carries some prima facie backfire risk, via pot
I had in mind