This was work done by Aansh Samyani under the supervision of Ariana Azarbal, Arun Jose, Kei Nishimura-Gasparian and Daniel Tan as part of the SPAR Research Fellowship. TL;DR We benchmarked Inoculation Prompting (IP) and Preventative Steering (PS) in 4 SFT settings. We found PS has the following advantages: * PS...
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment doesn’t transfer to the student. If so, we get a fairly capable benign model, which we can use to perform tasks that we wouldn’t want...
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also...
We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on...
Many technical AI safety plans involve building automated alignment researchers to improve our ability to solve the alignment problem. Safety plans from AI labs revolve around this as a first line of defence (e.g. OpenAI, DeepMind, Anthropic); research directions outside labs also often hope for greatly increased acceleration from AI...
TL;DR: Simple inoculation prompts that prevent misalignment generalization in toy setups don't scale to more realistic reward hacking. When I fine-tuned models on realistic reward hacks, only prompts close to the dataset generation prompt were sufficient to prevent misalignment. This seems like a specification problem: the model needs enough information...
TL;DR: Models trained with outcome-based RL sometimes have reasoning traces that look very weird. In this paper, I evaluate 14 models and find that many of them often generate pretty illegible CoTs. I show that models seem to find this illegible text useful, with a model’s accuracy dropping heavily when...