Self-Other Overlap: A Neglected Approach to AI Alignment
Figure 1. Image generated by DALL·E 3 to represent the concept of self-other overlap Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio. Summary In this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance. There is a large body of evidence suggesting that neural self-other overlap is connected to pro-sociality in humans and we argue that there are more fundamental reasons to believe this prior is relevant for AI Alignment. We argue that self-other overlap is a scalable and general alignment technique that requires little interpretability and has low capabilities externalities. We also share an early experiment of how fine-tuning a deceptive policy with self-other overlap reduces deceptive behavior in a simple RL environment. On top of that, we found that the non-deceptive agents consistently have higher mean self-other overlap than the deceptive agents, which allows us to perfectly classify which agents are deceptive only by using the mean self-other overlap value across episodes. Introduction General purpose ML models with the capacity for planning and autonomous behavior are becoming increasingly capable. Fortunately, research on making sure the models produce output in line with human interests in the training distribution is also progressing rapidly (eg, RLHF, DPO). However, a looming question remains: even if the model appears to be aligned with humans in the training distribution, will it defect once it is deployed or gathers enough power? In other words, is the model deceptive? We introduce a method that aims to reduce deception and increas
To the degree worries of this general shape are legitimate (we think they very much are), seems like it would be wise for the alignment community to more seriously pursue and evaluate tons of neglected approaches that might solve the fundamental underlying alignment problem, rather than investing the vast majority of resources in things like evals and demos of misalignment failure modes in current LLMs, which definitely are nice to have, but almost certainly won't themselves directly yield scalable solutions to robustly aligning AGI/ASI.