“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?
Thanks to llll for helping me think this through, and for providing useful comments. Epistemic Status: My best guess Introduction It might be worthwhile to systematically mine AI technical research to find “unintentional AI safety research”—research that, while not explicitly conducted as AI safety research, contains information relevant to AI safety. An example of unintended safety research is Douglas Lenat's work on the heuristic-search system Eurisko, which inadvertently demonstrated specification gaming when Eurisko exploited a loophole in the rules of the role-playing game Traveller TCS to win the US national championship in 1981 and 1982.[1] This post is not meant to suggest that AI safety researchers don’t already look for unintentional safety research, but I’m unaware of any effort to do so in a systematic way designed to extract as much “safety value” as possible from technical research. Related work Tshitoyan, Vahe, et al. "Unsupervised word embeddings capture latent knowledge from materials science literature." Nature 571.7763 (2019): 95-98. In this study, a language model trained on an unsupervised text corpus of scientific abstracts learned to predict discoveries that were made after the training data cutoff date—that is, the model learned to predict future scientific discoveries. “This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.” Reasons to systematically mine technical research Gleaning insights from technical research has already demonstrated its value for safety researchers, even without a structured approach. For example, technical researchers demonstrated in practice that reward misspecification, bias in training data, distributional shift, and