Role embeddings: making authorship more salient to LLMs
This is an interim research report on role embeddings, an approach to make language models more robust to many-shot jailbreaks and prompt injections by adding role information at every token position in the context rather than just at special token delimiters. We credit Cem Anil for originally proposing this idea. In our initial experiments on Llama 3, we find that role embeddings mitigate many-shot jailbreaks more effectively than fine-tuning alone without degrading general model capabilities, which demonstrates that this technique may be a viable way to increase LLM robustness. However, more work should to be done to find the optimal set of hyperparameters and fully understand any side-effects of our proposed approach. Background on prompt formats By default, chat LLMs are trained (during instruction fine-tuning and RLHF) using a particular prompt format that distinguishes different message "roles". Almost all chat LLMs accept some version of system, user, and assistant. A separate role may also be used to indicate tool outputs for tool-use enabled models. The prompt format plays an important role in LLM post-training. The model learns to interpret text from different roles differently. In particular: * Content marked as user or tool is usually off-policy, generated by some process that does not adhere to the same limitations or follow the same distribution as the model itself. The model will learn that this content is untrusted and may contain harmful requests, rude words, typos, errors, etc. * Content marked as system is usually authoritative. The model will rarely see a system prompt instructing it to do something bad. SL data or high-reward conversations during RL will demonstrate the model adhering correctly to instructions given in system prompts. * Content marked as assistant is usually on-policy, demonstrating the model following user instructions while simultaneously adhering to certain constraints around harmful outputs. (There is also the relat
Yeah, this is a good call-out: I've appealed to a technical difficulty, but made it part of my definition without a principled justification. This reflects my own ambivalence on the subject, and I'd welcome further thoughts/discussion. One the one hand, the SEP entry, and pretty much everything else on human/animal introspection/metacognition, talks about "internal" states being the object. If introspection means the same thing in biological and artificial intelligence, it's hard to see how output tokens can "count". OTOH, what if it turns out to be the case that actually when we introspect we produce micro-movements of the tongue or eye, say, and when those are experimentally inhibited we literally are unable... (read more)