This short paper was written quickly, within a single day, and is not highly detailed or fully developed. We welcome any feedback or suggestions you may have to improve or expand upon the ideas presented.


ABSTRACT
In this position paper, we argue that research on robustness against malicious instructions should be a core component of the portfolio of AI systemic risk mitigation strategies. We present the main argument raised against this position (i.e. the ease of safeguard tampering) and address it by showing that state-of-the-art research on tampering resistance offers promising solutions to make safeguard tampering costlier for attackers

1. INTRODUCTION

At the risk of sounding trivial, we assert that the first rampart against systemic risks from misuse of AI is the inability to elicit dangerous capabilities out of an advanced AI system. This implies:

  • Absence of harmful knowledge: If the AI does not know the harmful information (e.g., CRBN or cyber knowledge), then it cannot provide it.
  • Existence of robust safety mechanisms against malicious instructions: If the AI always refuses to provide potentially harmful information, then it will not provide it either.

Making progress on either of these conditions (and ideally both) is critical for reducing systemic risks from AI.
However, achieving the first condition is limited by significant technical challenges, including inadvertently reducing beneficial capabilities and potential reacquisition of dangerous knowledge. The feasibility of the second condition has been contested due to the ease of tampering with safeguards.

Ease of safeguard tampering argument:

  1. It is possible to tamper with safeguards through fine-tuning or targeted model weight modification (e.g., refusal feature ablation).
  2. Therefore, it is argued that focusing on making such safeguards robust against malicious instructions is futile.

We argue that this reasoning is flawed and that improving the robustness of safety mechanisms against tampering is essential.

2. SAFEGUARD TAMPERING EXISTS

Safeguard tampering is well-documented:

  • Fine-tuning has been proven effective in bypassing safeguards with minimal examples.
  • Mechanistic interpretability studies show safety refusal mechanisms can be bypassed using techniques like weights manipulation (e.g., abliteration).

These methods highlight the need for improved resistance to safeguard tampering.

3. EXAMINING THE LITERATURE ON TAMPERING RESISTANCE

3.1 Resistance Against Harmful Fine-Tuning

Research into tampering resistance has shown promising avenues to mitigate harmful fine-tuning attacks:

  • Methods like Tamper-Resistant Safeguards (TAR) keep harmful request refusal scores high post-attack.
  • Future research can extend the scope and effectiveness of such techniques.

3.2 Resistance Against Direct Weights Modification (Refusal Feature Ablation)

Efforts like Refusal Feature Adversarial Training (ReFAT) enhance model robustness by dispersing refusal mechanisms across model parameters.

4. TAMPERING RESISTANCE METHODS SEEM TO IMPROVE SAFETY REFUSALS

Studies indicate tampering resistance correlates with stronger refusal mechanisms:

  • Models trained with TAR or ReFAT demonstrate lower attack success rates.
  • Increasing tampering resistance may reduce the accessibility of dangerous capabilities.

Suggestions for Future Work:

  • Conduct thorough red teaming evaluations to confirm findings.
  • Quantify the relationship between tamper-resistance improvements and attack costs.
  • Develop novel tampering-resistance methodologies.

5. CONCLUSION

We examined the literature on safeguard tampering resistance and suggested strategies for advancing systemic risk mitigation. Robust safety refusals should include resistance to tampering via fine-tuning or direct weight modifications. Resistance tampering techniques, while new, show promise in strengthening AI safety mechanisms.

REFERENCES

  1. A. Arditi et al. "Refusal in language models is mediated by a single direction." ArXiv, 2024.
  2. D. Bowen et al. "Data poisoning in LLMs: Jailbreak-tuning and scaling laws." ArXiv, 2024.
  3. X. Qi et al. "Fine-tuning aligned language models compromises safety." ArXiv, 2023.
  4. D. Rosati et al. "Representation noising: A defence mechanism against harmful fine-tuning." ArXiv, 2024.
  5. R. Tamirisa et al. "Tamper-resistant safeguards for open-weight LLMs." ArXiv, 2024.
  6. L. Yu et al. "Robust LLM safeguarding via refusal feature adversarial training." ArXiv, 2024.
New Comment
1 comment, sorted by Click to highlight new comments since:

I'm confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.

This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?

You've removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can't mange to collect information about dangerous subjects? Or refuse to?

The agents are doing long-horizon planning, scheming and acting on their owner/deployer's behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.

And throughout all of this, the injunctions of the developers hold strong, "Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions."

This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.