Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it's really useful that y'all are evaluating the shallowness of RMU's robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrel... (read more)
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I've been thinking about refusal vs unlearning, say with respect to harmful content:
Refusal is like an implicit classifier, sitting in front of the model.
If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
This classification is vulnerable to jailbreaks - tricks that flip the classification, enabling harm
Thanks so much for this investigation! Our paper focused mostly on the API-fine-tuning threat model (e.g. OpenAI fine-tuning API) -- where after the adversary can conduct black-box fine-tuning on the base model, but the defender can conduct safety interventions like unlearning following fine-tuning. Through that lens, we only examined probing and GCG in the paper; it's really useful that y'all are evaluating the shallowness of RMU's robustness to a broader set of adversaries. I believe @Fabien Roger similarly demonstrated that fine-tuning on a bit of unrel... (read more)
Thanks for the nice reply!
Yes, it makes sense to consider the threat model, and your paper does a good job of making this explicit (as in Figure 2). We just wanted to prod around and see how things are working!
The way I've been thinking about refusal vs unlearning, say with respect to harmful content:
- Refusal is like an implicit classifier, sitting in front of the model.
- If the model implicitly classifies a prompt as harmful, it will go into its refuse-y mode.
- This classification is vulnerable to jailbreaks - tricks that flip the classification, enabling harm
... (read more)