Another way adversarial training might be useful, that's related to (1), is that it may make interpretability easier. Given that it weeds out some non-robust features, the features that remain (and the associated feature visualizations) tend to be clearer, cf e.g. Adversarial Robustness as a Prior for Learned Representations. One example of people using this is Leveraging Sparse Linear Layers for Debuggable Deep Networks (blog post, AN summary).
The above examples are from vision networks - I'd be curious about similar phenomena when adversarially training language models.
A common justification I hear for adversarial training is robustness to underspecification/slight misspecification. For example, we might not know what parameters the real world has, and our best guesses are probably wrong. We want our AI to perform acceptably regardless of what parameters the world has, so we perform adversarial domain randomization. This falls under 1. as written insofar we can say that the AI learnt a stupid heuristic (use a policy that depends on the specific parameters of the environment for example) and we want it to learn something more clever (infer the parameters of the world, then do the optimal thing under those parameters). We want a model whose performance is invariant to (or at least robust to) underspecified parameters, but by default our models don't know which parameters are underspecified. Adversarial training helps insofar as it teaches our models which things they should not pay attention to.
In the image classification example you gave for 1., I'd frame it as:
* The training data doesn't disambiguate between "the intended classification boundary is whether or not the image contains a dog" and "the intended classification boundary is whether or not the image contains a high frequency artifact that results from how the researchers preprocessed their dog images", as the two features are extremely correlated on the training data. The training data has fixed a parameter that we wanted to leave unspecified (which image preprocessing we did, or equivalently, which high frequency artifacts we left in the data).
* Over the course of normal (non-adversarial) training, we learn a model that depends on the specific settings of the parameters that we wanted to leave unspecified (that is, the specific artifacts).
* We can grant an adversary the ability to generate images that break the high frequency artifacts while preserving the ground truth of if the image contains a dog. For example, the Lp norm-ball attacks oft studied in the adversarial image classification literature.
* By training the model on data generated by said adversary, the model can now learn that the specific pattern of high frequency patterns doesn't matter, and can now spend more its capacity paying attention to whether or not the image contains dog features.
What does wanting to use adversarial training say about the sorts of labeled and unlabeled data we have available to us? Are there cases where we could either do adversarial training, or alternately train on a different (potentially much more expensive) dataset?
What does it say about our loss function on different sorts of errors? Are there cases where we could either do adversarial training, or just use a training algorithm that uses a different loss function?
- The model does not have a reasonable concept of injury that includes getting hit by a bullet, or doesn’t know that getting shot at would cause you to get hit by a bullet
- The model has this concept of injury, and understands that it will occur, but ‘thinks’ you only care about injuries involving swords or knives because almost all injuries in your training data involved those
Could you clarify ( with examples) what you mean by the second point. I was thinking that the second point occurs when model is not robustly understanding the concept of injury which means this is a deficiency in capabilities?
What are the reasons we might want to do adversarial training?
Here’s a rough taxonomy.
It has overfit to whatever particular dataset you trained on. It is best described as ‘trying’ to pursue this particular heuristic, and it appears to be doing well because of some arbitrary correlation in your dataset. For example, it is succeeding at image classification by attending to high-frequency noise in the image. Doing adversarial training is useful because it finds examples where this heuristic is uncorrelated with what you want. This allows you to spot the problem and to improve the model’s alignment. It doesn’t matter if the examples are particularly consequential failures; it just matters that they help the model learn the correct pattern. The adversarial examples you find are a ‘surprise’ to the model.
Your model is fairly well-described as trying to pursue some goal that is unaligned with you. It performs well on your training set because this is instrumentally useful. It plans to defect in particular situations where this is advantageous. Doing adversarial training is useful if you can use it to fool the model into thinking it’s in some situation where it should defect. This allows you to spot that it is deceptive, and maybe train this deception away. It doesn’t matter if the example involves the model causing a particularly bad thing to happen, it just matters that you are able to detect the deception. The adversarial examples you find are not a ‘surprise’ to the model -- it was deliberately planning to defect in some situations.
For example, an adversarial AI can present your model with some sequence of inputs that cause it to start ‘trying’ to help this AI instead of you. Your model ‘wants’ to avoid this and doesn’t know in advance which situations will induce this. Adversarial training is useful to find these situations and fix these vulnerabilities. It is important to find the most consequential failures - you care less about situations where the model is briefly unaligned but doesn’t do significant harm.
For example, sometimes it crashes a car because of a failure to recognise another car coming towards it. Adversarial training is useful to find cases where this happens. It is important to find the most consequential failures - you care less about situations where the model is briefly incompetent but doesn’t do significant harm. The adversarial examples you find are a ‘surprise’ to the model.
Aside: defining alignment
One confusing edge case for distinguishing alignment failure from capabilities failure is if the model ‘knows’ some concept, but doesn’t ‘know’ that that was the concept you wanted.
E.g. suppose the model classifies someone getting shot with a gun as not an injury. There are 3 possibilities:
I think the first possibility is definitely a capabilities failure, and the last is definitely an alignment failure, but I’m not sure how to categorize the middle one. The key test is something like: do we expect this problem to get better or worse as model capability increases? I’m inclined to say that we expect it to get better, because a more capable model will be better at guessing what humans want, and better at asking for clarification if uncertain.