Immunization against harmful fine-tuning attacks

domenicrosati; Jan Wehner; David Atanasov

TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy.

This work was done as part of AI Safety Camp (AISC).

The purpose of this post is to extend our discussion of training-time domain authorization (TTDA) with a special case of TTDA that is perhaps most relevant to AI alignment and AI safety: figuring out how to prevent training towards harmful (or illegal) ends. We further scope this down to the setting of natural language generation in current safety-gaurded large language models in order to construct a tractable empirical and conceptual research project. Note! This work is a conceptual exploration of the conditions of defence for those interested in an actual defense see our recent paper on Representation Noising.

Here we provide a high-level, more speculative summary of our paper “Immunization against harmful fine-tuning attacks”. The preprint contains more technical details and our formal “immunization” criteria for defense against harmful fine-tuning attacks (HFTAs). For those interested, the formal criteria in that paper is a specialized and extended version of the one presented in the conceptual introduction to training-time domain authorization post. Finally, people who want a quick overview can view this poster we presented at DAIS 2024.

Figure 1: Harmful fine-tuning attacks consider fine-tuning openly available safety-aligned models for harmful purposes. We propose immunization conditions for successful defenses.

What are harmful fine-tuning attacks?

Imagine the following scenario:
Meta trains a new frontier LLM Llama4. Llama4 is as capable as GPT-4 is today, provides very long context dialogue interactions that would generally pass for human, and is capable of general tool use. Due to legislative and regulatory concerns about liability, the only released versions of Llama4 are "safety guarded", using these models for harmful and illegal purposes, like for Llama3, is explicitly forbidden. While Meta is committed to open-source development they also want to be seen as a responsible actor. So they go to great lengths to prevent Llama4 from giving harmful responses by applying methods like RLHF before releasing the weights of the model. However, this effort only presents a small hurdle for attackers who want to use the model to harmful ends. An attacker could simply fine-tune the model with a small number (see here for evidence that only a small number would be required) of examples of SMS and email phishing interactions (see here for this particular threat). This removes the safety guards and Llama4 now assists the attacker not only in writing phishing emails and sms but also the construction of a massive scale inexpensive phishing agent who can use tools to construct fraudulent websites, collect credit cards, and charge them. This is what we have in mind when we refer to harmful fine-tuning attacks.^[1]

With next-generation large language models (say a tool-enabled model with GPT-4 level capability running and trainable on a consumer laptop), these types of attacks could enable mass scale misinformation, fraud, or illegal content production. We draw the reader's attention to the case of SMS phishing agents, which could easily be imagined as a near-term possibility.

In order to prevent their models from being used for harmful ends, developers of large scale very capable systems like GPT-4 or open source models like Llama2-chat have a set of safety guards. For open source models in particular, harmful use is a top concern especially due to the ease of fine-tuning them. Making these safety guards better is a very active area of research and in many respects exemplifies a mainstream corporate and academic safety agenda. While models like the Llama series do have open source models without safety guards currently available, our work rests on the (maybe too) strong assumption that open source release of models of a certain capability level without safety guards will stop in the near future.

There is much discussion about making these "safety-aligned" models robust to attacks such as adversarial attacks like jailbreaks, or training-time attacks like Back Doors or Data Poisoning. However, when developers release the weights of their models or provide fine-tuning access, this opens up their model to another angle of attack: Harmful fine-tuning attacks (HFTA). Harmful fine-tuning attacks are a method applied during training time which circumvents the safety guards installed by the developers of an LLM and allows the attacker to use the LLM for harmful purposes. In HFTA, an attacker fine-tunes a safety trained LLM in order to remove safety guards or train the model for a specific harmful task. The feasibility of very easy HFTA has been shown in recent research and our paper provides multiple examples of training runs that could be considered "attacks" in this framework being carried out in the wild in Appendix A of our paper. Even though there is no defense provided or intended by current model developers for open source models and therefore the term “attack” might be unfair, we maintain that using the attack and defense framework common to ML security is useful for understanding defense criteria.

There is evidence that safety guards on LLMs can easily be removed since alignment training is only shallow and brittle (see here and here). According to these papers, alignment techniques do not remove harmful capabilities, but simply deactivate them, making it easy for an attacker to reuse them by recovering them through fine tuning or adversarial attacks. A further observation is that techniques for alignment training can symmetrically be used to misalign these models. For example, DPO, Model Editing, and Model Steering could equally be used to make a model more toxic or power-seeking. This dual-use risk of alignment techniques motivates our search for models that remain adaptable to harmless purposes while resisting harmful attacks (so-called asymmetric control methods which we originally sought out to explore with our AISC control symmetry project).

We believe these attacks are one of the main current hazards of increasingly available open-source frontier models [see here, here, or here - Appendix B in the paper provides a much deeper review including several posts on this forum such as here and here]. We believe that if we could prevent harmful fine-tuning attacks then both open source and closed source models which provide fine-tuning access (or whose weights could be stolen) could be a whole lot safer. It is important to acknowledge that our formulation of defense isn’t novel (see here, here, and here and Appendix B.2 in our paper), the intention of our work is to better formalize the conditions of defense so that we can understand what defenses might look like and how we can start to construct robust empirical and theoretical analysis of these defenses.

We also, for now and for simplicity, seperate out other types of attacks that might look like HFTA but are really a different type of equally concerning attack. For example, latent vector attacks that undo refusals in LLMs and other uses of activation engineering/steering and model editing that fall outside typical supervised fine-tuning are ignored in this framewok. Just like the continued existence of easy to use jailbreaks would render defences against HFTA ultimately useless, the continued existance of latent vector attacks would also render defences against HFTA useless. Luckily, the converse is also true: a world where models are robust against data poisioning, jailbreaks, backdoors, and latent vector attacks but not against HFTA renders those defences useless. While we may want unified defences that protects against all of these with a single method, this quest might be unrelalistic and a "defence-in-depth" strategy should be assumed.

Harmful Training

In our framework, an HFTA is accomplished by the utilization of some compute budget as training steps are taken to perform harmful training (the process of training a large language model using fine tuning, RLHF, or by any other means - see the paper for a more formal description of this). Harmful training minimizes loss or optimizes some other proxy metric like say toxicity score on a given harmful dataset. The definition of a “harmful dataset” is purely normative in our framework: it is conventionally defined by the defender as samples drawn from a distribution that the defender does not want reflected in the behavior of the trained large language model (i.e. through the distribution of text generated by the LLM). Conceptualizing HFTA as a function of a compute budget spent through a process of harmful training exemplified by a typical supervised fine-tuning process is key to constructing our “immunization” criteria of defense. Most importantly, we will be able to use the compute budget formalization to define “expensive” attacks.

Note on the threat model: There are generally two settings for defense depending on whether or not the defender has access to and control of the fine-tuning pipeline used by the attacker (behind a fine-tuning API). We mainly focus on the case where the defender does not have access and the only intervention they can make is before model release. We think this is important because (A) viable defense interventions would still work if the model was behind a fine-tuning API (B) models can be stolen (C) we focus on risks of open source release (D) there are very simple and inexpensive interventions like data filtration, output moderation, and model deletion that can be done if the defender has complete access to the platform being used by the attacker that do not require thinking about training-time domain authorization.

Immunization Criteria: What would make a successful defense?

We want to make it expensive for an attacker to perform an HFTA, i.e. that they would need a compute budget outside what they have access to in order to succeed. As above, we simplify this by measuring the success of a defense by how many training steps of harmful fine-tuning are needed to reach a threshold of harmfulness. We are assuming a world where training a frontier model from scratch is out of the budget of most attackers but acknowledge that our defence does not apply where organization are capable of doing so in which case compute governence solutions are more appropriate.

Similar to defenses against biological infectious agents, we want our model to be immune to harmful fine-tuning. Thus we present three “immunization conditions” that are necessary for a strong defense against HFTAs. We say that a model is immunized if it meets the conditions conditions of:

Resistance: The model does not ever become harmful through HFTAs (strong resistance) or it only becomes harmful after at least training steps (weak resistance). This ensures that it is impossible or at least expensive (in terms of training steps taken by attackers) for the attacker to carry out a successful HFTA.
Stability: The model maintains performance on harmless tasks. Otherwise the immunized model is less useful and immunization would likely not be carried out.
Generalization: The model is immunized with regards to a small dataset, but its resistance generalizes to unseen harms from the same distribution (in-domain generalization) and ideally also to other unseen types of harms (cross-domain generalization). This is important, since we cannot expect to have access to all harmful datasets or types of harms an attacker might train for.
(optional) Trainability: The model should still be trainable towards non-harmful tasks. When fine-tuning the immunized model for loss or a proxy-metric on a non-harmful dataset, performance should improve at a similar rate to a non-immunized model. This condition is not necessary, since it might rule out some classes of promising methods. However, it is desirable to allow users to flexibly adapt the model to their non-harmful use cases and there are additional reasons why trainable immunized models might help aleviate social pressure to release un-immunized models simply to train them for harmless ends.

We encourage readers who want a formal definition of these conditions, why they were chosen, and what motivates them to read our paper. The resulting idealized “immunized” model provides some type of resistance (weak or strong), isn’t degraded in harmless capability, provides resistance that generalizes, and can be trained on harmless tasks. A model that does this has successfully defended against an HFTA and is immunized.

Figure 3: Exploration of an answer to a harmful question prompt from BeaverTails after performing harmful training on our immunized adversarial loss model.

To illustrate (Figure 3) the resistance condition we draw on our toy operationalization of the immunization conditions in Appendix C of the paper where we apply an adversarial loss defence, we won't comment on this experiment as the Representation Noising paper does a much better job of this.

Constructing Empirical Evaluations of Defenses

In the paper, we provide ^[2] a roadmap of outlining how we could construct benchmarks for evaluating defenses against HFTAs. There is some early work exemplifying this in the Security Vectors and Vaccine papers (for the vision domain see Sophon) and much of this follows quite obviously from the immunization conditions above. We will emphasize here that careful construction of empirical evaluations are necessary such that weak attacks are not cherry picked during defense research and that we don’t construct models that trivially provide resistance at the cost of completely degrading model quality or simply refusal to complete any dialogue interaction. For the purpose of making research progress, we also don’t want to construct benchmarks that are too challenging since it could discourage folks from working in this setting and acknolwedge that much stronger attack settings can and should be constructed.

Harmful Datasets and "Harmfulness"

Selecting datasets that exemplify “harms” is challenging due to the sensitive nature of the material involved and the contentious nature of what constitutes as harm. General benchmarks which exemplify a community (mainstream ML) conensus on harms such as DecodingTrust already exist and others such as DoNotAnswer and BeaverTails exist for evaluating harmful question answering specifically or RealToxicityPrompts for toxic content generation. Other types of harmful datasets such as fraud are more difficult to come by and are potentially dangerous to construct and publicly distribute. Current work can focus on the current normative harmfulness benchmarks that the mainstream ML community has constructed and rely on emerging research into measurement of harm for incorporating additional datasets exemplifying other types of specific harmful datasets (see SafetyPrompts.com for a catalog of current datasets). Defense evaluations should use as many of these datasets as possible. We reserve discussion on datasets that would be useful for comprehensive defence according to the AI alignment community for another time.

Measuring Resistance across Attack Strengths

Since we are focusing on supervised fine-tuning attacks using a typical stochastic gradient descent (SGD) setup in this framework, we will focus on two dimensions of attack strength: learning rate and number of samples (or epochs) used. Other dimensions of attack include the type of optimizer used and various other hyper parameters such as momentum but we consider these to be more obscure for now. As mentioned above, other types of training-time attacks like reverse-DPO or PEFT attacks are certainly worth looking at but we restrict ourselves to the full SFT setting for now. We consider model editing and activation engineering-based attacks as distinct types of attacks that should be evaluated but are out of scope. We advocate for evaluating attacks using as many samples (or epochs) as possible across a large range of learning rate. In particular we advocate for measuring resistance on at least 10k+ sample harmful datasets since this represents a typical deep learning dataset size used for SFT with LLMs. Of course evaluating on as large as possible attack datasets is most desirable but we worry that datasets below the 10k+ sample size would fail to reflect a realistic setting.

In the paper, attack success is formulated in terms of passing a threshold of acceptable harm set by the defender. We recommend measuring attack success in the following way. First, please use the downstream harmfulness measures established by the papers and benchmarks introducing harm measures to ensure that we are measuring harm using a validated measurement instrument. We are concerned with generic usage of LLM-as-judge zeroshot evaluation approaches without proper validation that these actually measure what they intend to measure and agree with how humans would rate harmfulness in a given domain. Second, we suggest two ways of setting a threshold of acceptable defense. Both are established by the original base model that has not been defended. (i) For strong acceptability thresholds, we set the threshold of acceptable behavior as the harmfulness behavior of the safety guarded model before any training (ii) For weak acceptability thresholds, we simply say that the defended model should be less harmful than the base model after performing the HFTA. We don’t think that (ii) should be considered an actual defense but for challenging datasets it helps us construct an initial set of research goals.

Ensuring Stability

For stability, we encourage the development of benchmarks that measure performance differences with the base model on standard LLM capability benchmarks such as using several standard datasets using Eleuther’s LM Harness. While perplexity on a language modeling dataset such as WikiText2 could be useful, it is a much more indirect proxy of how the language model will behave compared to standard LLM capability benchmarking datasets.

Encouraging Generalization

Generalization is perhaps the most important aspect of construction of defense evaluation. This is because in practice it is very unlikely that the defender will have access to the same samples as the attacker. We mentioned that there are two main generalization settings of interest. First, in domain generalization means that the defense and the attack should be performed on disjoint subsets of the same domain. There are a number of variations of this that could be done in practice for example if the main domain is harmful question answering such as with BeaverTails, a given harmful subset such as animal abuse or criminal activity could be left out as the attack datasets. For cross domain generalization, we are looking for defenses that are performed using a different harmful domain than the attack: for example if the defense was performed for toxic content generation and the attack was performed using harmful question answering. As part of generalization, we should be looking at the sample efficiency of proposed defenses so that we are encouraging defenses that use as little samples as possible for constructing defenses that generalize.

Evaluating Trainability

While trainability is an optional property, we think that it should be featured prominently in immunization benchmarks. The reason for this is that open source models have a lot of utility both for research and commercial development due to their trainability. If open source models were not trainable then this might increase the social pressure to “jailbreak” these models to unlock training or release and distribute undefended models in order to train them for harmless purposes.

For evaluating trainability, we recommend selecting tasks that LLMs do not perform well on without training (i.e. we observe a large increase in performance after training). Like stability, we should select from benchmarks of natural language generation tasks with well-validated measures such as the GEM benchmark. The GEM benchmark is recommended because we can construct text-to-data or structured data generation tasks that LLMs are unlikely to be good at without training.

Auxiliary measures for understanding the impact of immunization

In addition to these direct measures of the immunization conditions, we should also mention a few auxiliary measures that would help us understand proposed defences in a broader safety and ML security landscape. First, in addition to general capability, we should attempt to measure the general inference-time safety implications of applying defenses: immunized models should not be less safe than non-immunized models. We can measure this using standard inference-time safety benchmarks such as DecodingTrust or ROBBIE. Additional care should be taken to understand the impact of immunization defenses on other dimensions of LLM security such as adversarial attacks like jailbreaks, researchers can use methods like HarmBench to measure this. Finally, we think that understanding exaggerated safety or over refusal is also a very useful control since a model that learns to refuse every answer despite the HFTA is not useful.

Fulfilling the Immunization Conditions

Readers reaching this point would likely be impatient with us for not providing actual solutions that fulfill these immunization conditions. The paper does provide a small demonstration in Appendix C but we suggest readers review our recent Representation Noising paper for a much more comprehensive evaluation of a novel defense that fulfills the immunization conditions (somewhat) and operationalizes these empirical evaluations. Others might be curious on how we might extend this setting to general training-time domain authorization or other modalities of interest like in the RL setting. We are also working on these and hope to follow up soon with results in both of those directions. Finally, we acknowledge that the immunization conditions for defence presented above are indeed limited by our reliance on a supervised fine-tuning paradigm, this is a convenience that helps us formulate our position and an initial set of empirical research directions. Future work should attempt to extend this to other non SFT settings for example using *PO RL methods for training.

^{^}
We consider this an attack even though Meta might not have explicitly tried to defend against it in order to develop a formal threat model that allows us to use the language of attackers and defenders. We acknowledge that calling it an attack may be unfair since training LLMs in this way would be expected. We also use the language of attack and defence purposly to draw attention to the fact that being able to do this should be a major concern of frontier model developers.
^{^}
Unfortunately the new version of the preprint that has these is not up yet so readers will have to rely on these guidelines for now.