We might be in a generative AI bubble. There are many potential signs of this around: * Business investment in generative AI have had very low returns. * Expert opinion is turning against LLM, including some of the early LLM promoters (I also get this message from personal conversations with...
Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use...
With many thanks to Sasha Frangulov for comments and editing Before publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical,...
This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking. Abstract > Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We...
Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways. As I've shown...
Before solving complicated problems (such as reaching a decision with thousands of variables and complicated dependencies) it helps to focus on solving simpler problems first (such as utility problems with three clear choices), and then gradually building up. After all, "learn to walk before you run", and with any skill,...
Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the...