I'm confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.
This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?
You've removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can't mange to collect information about dangerous subjects? Or refuse to?
The agents are doing long-horizon planning, scheming and acting on their owner/deployer's behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.
And throughout all of this, the injunctions of the developers hold strong, "Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions."
This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.
ABSTRACT
In this position paper, we argue that research on robustness against malicious instructions should be a core component of the portfolio of AI systemic risk mitigation strategies. We present the main argument raised against this position (i.e. the ease of safeguard tampering) and address it by showing that state-of-the-art research on tampering resistance offers promising solutions to make safeguard tampering costlier for attackers
1. INTRODUCTION
At the risk of sounding trivial, we assert that the first rampart against systemic risks from misuse of AI is the inability to elicit dangerous capabilities out of an advanced AI system. This implies:
Making progress on either of these conditions (and ideally both) is critical for reducing systemic risks from AI.
However, achieving the first condition is limited by significant technical challenges, including inadvertently reducing beneficial capabilities and potential reacquisition of dangerous knowledge. The feasibility of the second condition has been contested due to the ease of tampering with safeguards.
Ease of safeguard tampering argument:
We argue that this reasoning is flawed and that improving the robustness of safety mechanisms against tampering is essential.
2. SAFEGUARD TAMPERING EXISTS
Safeguard tampering is well-documented:
These methods highlight the need for improved resistance to safeguard tampering.
3. EXAMINING THE LITERATURE ON TAMPERING RESISTANCE
3.1 Resistance Against Harmful Fine-Tuning
Research into tampering resistance has shown promising avenues to mitigate harmful fine-tuning attacks:
3.2 Resistance Against Direct Weights Modification (Refusal Feature Ablation)
Efforts like Refusal Feature Adversarial Training (ReFAT) enhance model robustness by dispersing refusal mechanisms across model parameters.
4. TAMPERING RESISTANCE METHODS SEEM TO IMPROVE SAFETY REFUSALS
Studies indicate tampering resistance correlates with stronger refusal mechanisms:
Suggestions for Future Work:
5. CONCLUSION
We examined the literature on safeguard tampering resistance and suggested strategies for advancing systemic risk mitigation. Robust safety refusals should include resistance to tampering via fine-tuning or direct weight modifications. Resistance tampering techniques, while new, show promise in strengthening AI safety mechanisms.
REFERENCES