[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
This research was completed for London AI Safety Research (LASR) Labs 2024 by Yohan Mathew, Ollie Matthews, Robert McCarthy and Joan Velja. The team was supervised by Nandi Schoots and Dylan Cope (King’s College London, Imperial College London). Find out more about the programme and express interest in upcoming iterations here. The full paper can be found here, while a short video presentation covering the highlights of the paper is here (note that some graphs have been updated since the presentation). Introduction Collusion in multi-agent systems is defined as 2 or more agents covertly coordinating to the disadvantage of other agents [6], while steganography is the practice of concealing information within a message while avoiding detection. In future multi-AI systems there is a risk that due to misspecified incentives, LLMs will learn to collude via the use of text-based steganography and bypass human oversight [1]. Concrete settings where steganography has previously been argued to present a risk include: * Models subtly conveying demographic biases present in their training data [1]; * Powerful models using encoded reasoning (where the LLM encodes intermediate steps of misaligned reasoning in the generated text in a way that is not understandable to human readers) to generate sycophantic responses [2, 4]; * Settings where an “untrusted” AI is monitoring the outputs of another “untrusted” AI, and they collude via the use of steganography to bypass oversight [3]. These concerns motivate the need for research into steganography. Our paper investigates the possibility of steganography ‘naturally‘ emerging in LLMs, as well as the robustness of current steganography mitigation techniques. We outline the contributions of the paper below. 1 - We Present Methods for Eliciting Steganography in LLMs We devise some optimization methods to automatically elicit steganography in LLMs - a gradient-based reinforcement learning (GBRL) method, and two variants of