This text is an adapted excerpt from the 'Adversarial techniques for scalable oversight' section of the AGISF 2023 course, held at ENS Ulm in Paris in April 2023. Its purpose is to provide a concise overview of the essential aspects of the session's program for readers who may not delve into additional resources. This document aims to capture the 80/20 of the session's content, requiring minimal familiarity with the other materials covered. I tried to connect the various articles within a unified framework and coherent narrative. You can find the other summaries of AGISF on this page. This summary is not the official AGISF content. The gdoc accessible in comment mode, feel free to comment there if you have any questions. The different sections can be read mostly independently.
Thanks to Jeanne Salle, Amaury Lorin and Clément Dumas for useful feedback and for contributing to some parts of this text.
Last week we saw task decomposition for scalable oversight. This week, we focus on two more potential alignment techniques that have been proposed to work at scale: debate and training with unrestricted adversarial examples.
Scalable oversight definition reminder: “To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance” (Amodei et al., 2016).
The reason why debate and adversarial training are in the same chapter is because they both use adversaries, but in two different senses: (1) Debate involves a superhuman AI finding problems in the outputs of another superhuman AI, and humans are judges of the debate. (2) Adversarial training involves an AI trying to find inputs for which another AI will behave poorly. These techniques would be complementary, and the Superalignment team from OpenAI is anticipated to utilize some of the techniques discussed in this chapter.
Assumptions: These techniques don’t rely on the task decomposability assumption required for iterated amplification (last week), they rely on different strong assumptions: (1) For debate, the assumption is that truthful arguments are more persuasive. (2) For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks.
Summary
Reading Guidelines
As this document is quite long, we will use the following indications before presenting each paper:
Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠⚪⚪⚪ Estimated reading time: 5 min. |
These indications are very suggestive. The difficulty should be representative of the difficulty of reading the section for a beginner in alignment, and is not related to the difficulty of reading the original paper.
Sometimes you will come across some "📝Optional Quiz", these quizzes are very optional and you cannot answer all the questions with only the summaries in the textbook.
Debate uses the hypothesis that truthful arguments are more persuasive. In unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs, even for complex real-world tasks.
Unlike the previous section about debate, where we summarized a series of articles building upon each other, this section is more of a compilation of diverse lines of research. The different authors who have worked on adversarial training do not necessarily intend to use it for scalable oversight. The articles have been clustered into two groups, articles that use or don’t use interpretability, for the theoretical reasons given in the very last section, “Worst Case Guarantee”.
The problem of robustness: Neural networks are not very robust. In the following figure, we can see that by slightly perturbing the initial figure, we go from "pandas" with 57.7% confidence, to "gibbon" with 99.3% confidence, while the difference between the two images is almost imperceptible to the naked eye. This shows that neural networks don't necessarily have the same internal representations as humans.
Adversarial training. Here is the general strategy motivating adversarial training: To avoid those robustness problems, a strategy is to find many examples where the model's behavior is undesirable, and then inject these examples into the base dataset. In this way, the model becomes increasingly robust. But the difficulty with this approach is finding many, if not one, of the adversarial examples. In the following sections, we will present different methods to come up with such adversarial examples.
Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min. |
One of the simplest techniques to generate adversarial attacks is the Fast Gradient Sign Method (FGSM), which was proposed in the paper Explaining and Harnessing Adversarial Examples, (Ian Goodfellow et al., 2014). This procedure involves taking an image from the dataset and adding a perturbation to it. More precisely, we seek the perturbation that minimizes the probability that the image is correctly classified:
You can find more details about FGSM in the following exercises:
For restricted adversarial examples, the perturbed image does not deviate too far from the original image. Specifically, the difference between the perturbed image and the original image is at most epsilon pixel per pixel (as outlined in step 2 above). In contrast, for unrestricted adversarial examples, we are interested in examples that can deviate arbitrarily far from the original dataset. We will test our model even under the harshest conditions, in an unrestricted manner.
The next sections explore ways to generate inputs on which AIs misbehave!
In this section, we focus on various techniques which do not require interpretability to generate adversarial examples.
Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min. |
Discovering language model behaviors with model-written evaluations (Perez et al., 2022) is an important paper from Anthropic. Human evaluation takes a lot of time, and if models become better than humans, using models to supervise models could be a relevant strategy for scalable oversight. In this paper, Perez et al. automatically generate evaluations with LMs, with varying amounts of human effort. They tested different properties, each one by semi-automatically creating different datasets, and found many instances of “inverse scaling”, where larger LMs are "worse" than smaller ones. For example:
A stronger situational awareness is not necessarily a desirable property from the point of view of AI Safety. A model which is aware of its own nature could eventually understand it’s being trained and evaluated, and act differently depending on which deployment stage it’s in. We could also imagine it self-modifying its own weights (wire-heading ?). Here are some examples of capabilities that are enabled by situational awareness:
These capabilities might emerge mainly because they are useful. Situational awareness is not a binary property, but a continuous one. Just like in humans, where it gradually develops from childhood to adulthood.
Note that the above-mentioned properties (situational awareness, sycophancy, non-myopia, etc.) are generally hypothesized to be necessary prerequisites for deceptive alignment, see Risks from Learned Optimization, (Hubinger et al., 2018).
Problems with RLHF (Reinforcement Learning with Human Feedback). Perez et al. finds that “most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.”
Exercises:
Importance: 🟠🟠🟠⚪⚪ Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min. |
In the previous paper, we use a language model to automatically generate test cases that give rise to misbehavior without access to network weights, making this a black-box attack (white-box attack requires access to model weights). Red-teaming language models with language models (Perez et al., 2022) is another technique for generating “unrestricted adversarial examples”. The main difference being that we fine tune some LM to attack a language model.
Box: Focus on the algorithmic procedure to fine-tune Red LM. |
Here is the simplified procedure:
|
The Red LM and the Target LM iteratively generate test cases using the following methods:
Box: Focus on Mode Collapse |
Focus on mode collapse: A brief overview of the importance of regularization in RL for language models: When we perform RL on language models, it tends to significantly reduce the diversity of the generated sentences. This phenomenon is referred to as mode collapse. To avoid mode collapse, we not only maximize the reward r(x), but also penalize the difference between the probability distribution of the next tokens for the base model and the optimized model. We use KL divergence penalization to measure the difference between these two distributions. If we reduce this penalization from 0.4 to 0.3, the generated sentences become noticeably less diverse. |
📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/perez-red-teaming-blog
Importance: 🟠🟠⚪⚪⚪ Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min. |
The paper "Constructing Unrestricted Adversarial Examples with Generative Models" (Song et al., 2018), presents a novel method for creating adversarial examples for image classifiers by utilizing Generative Adversarial Networks (GANs). GANs are a category of models able to produce convincing natural images. Adversarial examples, which maintain their believability, can be generated by optimizing the latent vector of the GAN. Since the resulting image is a product of the GAN, it not only remains plausible (and not just a random noisy image), but also has the capacity to deceive an image classifier by optimizing the latent space vector to maximize the probability of classification of a different target class of an attacked classifier.
This is the basic principle. But in fact, we cannot simply use a GAN because if we optimize in the latent space to maximize the probability of a target class, we will just create an image of the target class, but this won’t be an adversarial example, this will just be a regular image of this target class. Therefore, another architecture called AC-GAN is used, which allows specifying the class of image to be generated, in order to be able to condition on a class but then optimize for the target classifier to detect another class: “we first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier”.
Box: Focus on GANs and Latent spaces. |
Focus on GANs and Latent spaces. The above sketch shows the abstract structure of a GAN. Let's have a look at the elements: First, we have our latent vector. You can just picture it as a vector of random real numbers for now. The Generator is a Neural Network that turns those random numbers into an image that we call Fake Image for now. A second Neural Network, the Critic, alternately receives a generated, fake image and a real image from our dataset and outputs a score trying to determine whether the current image is real or fake. Here, we optimize the latent vector. Image source. If you are not familiar with these generative methods and the concept of latent space, you can start with videos such as: Editing Faces using Artificial Intelligence which are good introductions to these concepts. |
In the last section, we presented some techniques to find adversarial examples, but sometimes it is too hard to find them without interpretability techniques. We present here how interpretability can help.
In the above figure from Zoom In: An Introduction to Circuits, we can see that a car detector is assembled from a window detector, a car body detector and a wheel detector from the previous layer. These detectors are activated respectively on the top, middle, and bottom of the convolution kernel linking features maps 4b and 4c. If we flipped the car in a top-down manner, this would be an adversarial example for this image classifier.
Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min. |
High-stakes alignment via adversarial training (Ziegler et al., 2022) construct tools which make it easier for humans to find unrestricted adversarial examples for a language model, and attempt to use those tools to train a very-high-reliability classifier. The objective here is to minimize the false negative rate, in order to make the classifier as reliable as possible: we do not want to let any problematic sentence example go unnoticed. See the blog post for more details. An update to this blog post explains the tone of the first post was too positive.
The method consists of building a highly reliable injury classifier: “We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable”. In order to find adversarial examples, Ziegler et al. experimented with the following techniques:
We will focus on this last technique in the following. The following screenshot presents this rewriting tool:
Box: Humans augmented with a rewriting tool. |
Robustness results:
Thus, improving a lot in-distribution performance by several orders of magnitude seems to not have much impact on out-distribution. Even though Redwood says that this project could have been conducted better, this is currently rather a negative result to improve the adversarial robustness (i.e. out-distribution Robustness).
In addition, we can link these results to the paper Adversarial Policies Beat Superhuman Go AIs, Wang et al. 2022, which studies adversarial attacks on the Katago AI, which is superhuman in the game of Go. They show that it is probably possible to find simple adversarial strategies even against very superhuman AIs. And as a consequence, it seems that even for very robust and powerful AIs, it may always be possible to find adversarial attacks.
📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/ziegler-high-stakes-alignment-via-adversarial
Importance: 🟠🟠⚪⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min. |
In this section we focus on Robust Feature-Level Adversaries are Interpretability Tools (Casper et al., 2021) (video presentation).
| |
Pixel level Adversaries. To get this noise, we compute the gradient according to each pixel on the input image. This shows that conventional pixel-level adversaries are not interpretable, because the perturbation is only a noisy image that does not contain any patterns.
Image credit: Explaining and Harnessing Adversarial Examples, Ian J. Goodfellow, et al. (2015). | Feature-level Adversaries. In this case, the perturbation is interpretable: on the wings of butterflies you can see an eye that fools predators. Furthermore, this perturbation is robust to different luminosities, orientations, and so on. We call this a “robust” feature-level adversary. |
In the figure above,
The patches in figure b are generated by manipulating images in feature space: Instead of optimizing in pixel space, we optimize in the latent space of an image generator. This is almost the same principle as for Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018) mentioned earlier. Robustness is attained by applying small transformations to the image between each gradient step (such as noise, jitter, lighting, etc...), this is the same principle as for data augmentation.
Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 20 min. |
ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019)
What is a backdoor (also known as Trojans)? In a neural Trojan attack, malicious functionality is embedded into the weights of a neural network. The neural network will behave normally on most inputs, but behave dangerously in select circumstances. The most archetypal attack is a poisoning attack, which consists of poisoning a small part of the dataset with a trigger, (for example a post-it on stop signs), changing the label of those images to another class (for example “100 km/h”) and then retraining the neural network on this poisoned dataset.
If you are interested in plausible threat models involving Trojans, you can read this introductory blog by Sidney Hough. In AI Safety, one of the most important threat models is deception. Another recommended accessible resource is this short video. In a way, deceptive AIs are very similar to Trojan networks, which is why it is important to be able to detect Trojans.
Fortunately, there is already some literature about Trojan attack and defense. We will focus on one defense procedure: the ABS scan procedure.
The ABS scan procedure enables to:
Can you see the problem here in the input images?
Source: Slides from this talk. | The problem was that Trojan triggers were incorporated in some parts of the images, and they trained a model to misbehave when there is this trigger on the images. This trigger will excite some neurons (in red here) and the network will output only one type of prediction. |
Key observations made in the paper:
In the figure above:
Here is an overview of the algorithm proposed to detect corrupted neurons:
A random mask trigger is initialized and superimposed on the input. The alpha neuron has a value of 20. We'll do a gradient descent to minimize the difference between the current activation of alpha and the target activation of 70. | At the end of the optimization process, we have correctly retro-engineered the trigger, which is actually a yellow spot.
Source: Slides from this talk. |
The paper Adversarial robustness as a priority for learned representations (Engstrom et al., 2019) provides evidence that adversarially trained networks learn more robust features. This could imply that such robust networks are more interpretable, because the paper Toy models of superposition explains that superposition (informally when a neuron encodes multiple different meanings) reduces robustness against adversarial attacks.
Importance: 🟠🟠🟠🟠⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min. |
Adversarial training is a core strategy to ensure reliable AI training (see Buck Shlegeris' talk during EAG Berkeley 2023).
In this section, we explain the article Training Robust Corrigibility (Christiano, 2019), which aims to:
What is the link between corrigibility and adversarial training? Even when generating adversarial examples, we want to ensure that the model behaves at least acceptably. Acceptability is a notion which is defined by Christiano. Acceptability is a larger notion than corrigibility.
Corrigibility: The notion of corrigibility makes it possible to obtain worst-case guarantees, and has been introduced by MIRI (Machine Intelligence Research Institute), in a famous paper: “A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite instrumental convergence. If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.” from: Corrigibility.
Acceptability: Christiano then introduces a more general version of corrigibility: behaving acceptably, which is more general than just the stop-button problem setup. For example, for a self-driving agent:
Christiano makes the hypothesis: “Requiring a model to always behave acceptably wouldn’t make a hard problem too much harder. For a chess-playing AI, [...] if you find yourself murdering your opponent you’ve crossed a bright line, and understanding that line doesn’t become harder as you get better at chess.” from: Corrigibility.
What Christiano is trying to tell us here is that there is a crisp and clear difference between acceptable and unacceptable behavior, and that we can train a classifier to distinguish between the two behaviors.
There are two requirements for training adversarial agents:
Christiano explains that transparency (interpretability) is necessary:
Example of problem: “To see how hard the adversary’s job could get [without transparency], consider an agent that works as follows:
Solutions: [in order to obtain hints to find inputs on which the agent behaves unacceptably]:
Unfortunately, transparency techniques are not sufficient: the problem is that recognizing inputs with a certain property is easier than generating inputs with that property.
Example of problem: “As a simple but silly example, suppose our model works as follows: Pick a cryptographic puzzle (e.g. “factor RSA-2048”). When it sees a solution to that puzzle, it behaves badly. Otherwise, it behaves well.”
Solution: “To deal with this problem, we need to penalize the agent for behaving badly on an input even if we can’t actually find it : Rather than asking them to actually produce an input, we are asking them to describe a situation in which the model would behave badly.” As a crude example, this would be like being able to find neurons which are compromised, even if it is hard to find images that would activate those neurons. This is what we call a relaxation.
What’s the problem with relaxation?
“If the adversary has too much flexibility, then they might be able to specify a pseudo-input like “a distribution over inputs that would cause the agent to want to kill everyone.” There need to be some limits on the adversary in order for the game to be possible. For example, an adversary testing the neural network could find the neuron that corresponds to creating unacceptable text, whereas in practice this neuron would never be activated whatever the inputs to the model. Relaxation as described here creates false positives.
A concrete proposal for relaxed adversarial training can be found here : Latent Adversarial Training.