Current safety training techniques do not fully transfer to the agent setting
TL;DR: We are presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing to directly execute harmful actions. However, all papers find that different attack methods like jailbreaks, prompt-engineering, or refusal-vector ablation do transfer. Here are the three papers: 1. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents 2. Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents 3. Applying Refusal-Vector Ablation to Llama 3.1 70B Agents What are language model agents? Language model agents are a combination of a language model and scaffolding software. Regular language models are typically limited to being chatbots, i.e. they receive messages and reply to them. However, scaffolding gives these models access to tools which they can directly execute and essentially puts them in a loop to perform entire tasks autonomously. To correctly use tools, they are often fine-tuned and carefully prompted. As a result, these agents can perform a broader range of complex, goal-oriented tasks autonomously, surpassing the potential roles of traditional chat bots. Overview Results across the three papers are not directly comparable. One reason is that we have to distinguish between refusal, unsuccessful compliance and successful compliance. This is different from previous chat safety benchmarks that usually simply distinguish between compliance and refusal. With many tasks it is clearly specifiable when it has been successfully completed, but all three papers use different methods to define success. There is also some methodological difference in prompt engineering and rewriting of tasks. Despite these differences, Figure 1 shows a similar pattern between all of them, attack methods such as jail-breaks, prompt engineering and mechanistic ch
I think this is sort of false. There are probably many low hanging fruit ways to increase longevity and quality of life and it is at least immaginable that we could get close-ish to immortality without AI. If we produce really smart human researchers they could extent lifespan by centuries allowing each person to take advantage of even more progress. Eventually, we might upload human consciousness or do something like that.