Introduction

This text is an adapted excerpt from the 'Adversarial techniques for scalable oversight' section of the AGISF 2023 course, held at ENS Ulm in Paris in April 2023. Its purpose is to provide a concise overview of the essential aspects of the session's program for readers who may not delve into additional resources. This document aims to capture the 80/20 of the session's content, requiring minimal familiarity with the other materials covered. I tried to connect the various articles within a unified framework and coherent narrative. You can find the other summaries of AGISF on this page. This summary is not the official AGISF content. The gdoc accessible in comment mode, feel free to comment there if you have any questions. The different sections can be read mostly independently.

Thanks to Jeanne Salle, Amaury Lorin and Clément Dumas for useful feedback and for contributing to some parts of this text.

Outline

Last week we saw task decomposition for scalable oversight. This week, we focus on two more potential alignment techniques that have been proposed to work at scale: debate and training with unrestricted adversarial examples. 

Scalable oversight definition reminder: “To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance” (Amodei et al., 2016).

The reason why debate and adversarial training are in the same chapter is because they both use adversaries, but in two different senses: (1) Debate involves a superhuman AI finding problems in the outputs of another superhuman AI, and humans are judges of the debate. (2) Adversarial training involves an AI trying to find inputs for which another AI will behave poorly. These techniques would be complementary, and the Superalignment team from OpenAI is anticipated to utilize some of the techniques discussed in this chapter.

Assumptions: These techniques don’t rely on the task decomposability assumption required for iterated amplification (last week), they rely on different strong assumptions: (1) For debate, the assumption is that truthful arguments are more persuasive. (2) For unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs even on complex real-world tasks.

Summary

Reading Guidelines

As this document is quite long, we will use the following indications before presenting each paper:

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠⚪⚪⚪ Estimated reading time: 5 min.

These indications are very suggestive. The difficulty should be representative of the difficulty of reading the section for a beginner in alignment, and is not related to the difficulty of reading the original paper.

Sometimes you will come across some "📝Optional Quiz", these quizzes are very optional and you cannot answer all the questions with only the summaries in the textbook.

Training using unrestricted adversarial examples

Debate uses the hypothesis that truthful arguments are more persuasive. In unrestricted adversarial training, the assumption is that adversaries can generate realistic inputs, even for complex real-world tasks.

Unlike the previous section about debate, where we summarized a series of articles building upon each other, this section is more of a compilation of diverse lines of research. The different authors who have worked on adversarial training do not necessarily intend to use it for scalable oversight. The articles have been clustered into two groups, articles that use or don’t use interpretability,  for the theoretical reasons given in the very last section, “Worst Case Guarantee”.

General strategy

The problem of robustness: Neural networks are not very robust. In the following figure, we can see that by slightly perturbing the initial figure, we go from "pandas" with 57.7% confidence, to "gibbon" with 99.3% confidence, while the difference between the two images is almost imperceptible to the naked eye. This shows that neural networks don't necessarily have the same internal representations as humans.

Figure: from Explaining and Harnessing Adversarial Examples, (Ian Goodfellow et al., 2014)

Adversarial training. Here is the general strategy motivating adversarial training: To avoid those robustness problems, a strategy is to find many examples where the model's behavior is undesirable, and then inject these examples into the base dataset. In this way, the model becomes increasingly robust. But the difficulty with this approach is finding many, if not one, of the adversarial examples. In the following sections, we will present different methods to come up with such adversarial examples.

A reminder about Restricted adversarial examples (Ian Goodfellow et al., 2014)

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

One of the simplest techniques to generate adversarial attacks is the Fast Gradient Sign Method (FGSM), which was proposed in the paper Explaining and Harnessing Adversarial Examples, (Ian Goodfellow et al., 2014). This procedure involves taking an image from the dataset and adding a perturbation to it. More precisely, we seek the perturbation that minimizes the probability that the image is correctly classified:

  1. First, we compute the gradient of the image (which is also an image) with respect to the probability of correct classification
  2. Once the gradient of the image is computed (the gradient of the image has the same shape as the image), we keep only the sign (+1 or -1) of each activation and multiply the result by a very small number epsilon.
  3. And we subtract this perturbation from the image. 

You can find more details about FGSM in the following exercises:

For restricted adversarial examples, the perturbed image does not deviate too far from the original image. Specifically, the difference between the perturbed image and the original image is at most epsilon pixel per pixel (as outlined in step 2 above). In contrast, for unrestricted adversarial examples, we are interested in examples that can deviate arbitrarily far from the original dataset. We will test our model even under the harshest conditions, in an unrestricted manner.

The next sections explore ways to generate inputs on which AIs misbehave!

Automatic adversarial training not using interpretability

In this section, we focus on various techniques which do not require interpretability to generate adversarial examples.

Discovering language model behaviors with model-written evaluations (Perez et al., 2022)

Importance: 🟠🟠🟠🟠🟠 Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min.

Discovering language model behaviors with model-written evaluations (Perez et al., 2022) is an important paper from Anthropic. Human evaluation takes a lot of time, and if models become better than humans, using models to supervise models could be a relevant strategy for scalable oversight. In this paper, Perez et al. automatically generate evaluations with LMs, with varying amounts of human effort. They tested different properties, each one by semi-automatically creating different datasets, and found many instances of “inverse scaling”, where larger LMs are "worse" than smaller ones. For example:

  • Sycophancy: larger LMs repeat back a dialog user’s preferred answer, 
  • Instrumental goals: larger LMs express greater desire to pursue concerning goals like resource acquisition and goal preservation,
  • Non-myopia larger LMs are more prone to sacrifice short-term gain for long-term gain,
  • Situational awareness larger LMs seem to be more self-aware of being language models,
  • Coordination: larger LMs tend to be more willing to coordinate with other AIs.

A stronger situational awareness is not necessarily a desirable property from the point of view of AI Safety. A model which is aware of its own nature could eventually understand it’s being trained and evaluated, and act differently depending on which deployment stage it’s in. We could also imagine it self-modifying its own weights (wire-heading ?). Here are some examples of capabilities that are enabled by situational awareness:

  • Being able to use the current date and time, which could be useful for using some ChatGPT plugins, for instance to book a train,
  • Being able to use geographical location,
  • Knowing that ML models are trained by gradient descent,
  • Knowing that a human is evaluating it,
  • etc.

These capabilities might emerge mainly because they are useful. Situational awareness is not a binary property, but a continuous one. Just like in humans, where it gradually develops from childhood to adulthood.

Note that the above-mentioned properties (situational awareness, sycophancy, non-myopia, etc.) are generally hypothesized to be necessary prerequisites for deceptive alignment, see Risks from Learned Optimization, (Hubinger et al., 2018).

 

Figure: From Perez et al., 2022

Problems with RLHF (Reinforcement Learning with Human Feedback). Perez et al. finds that “most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.”

Exercises:

  • Why does RLHF promote agency? Answer: Humans like almost autonomous agents. For example, when we ask ChatGPT to reformulate an email, we don't want to have to give it 50 instructions to do the task, we want ChatGPT to do the task with the minimum possible set of instructions, in an autonomous way. Humans prefer agents to tools (Why Tool AIs Want to Be Agent AIs).
  • Why could agency be problematic? Answer: Agency is problematic because it seems that becoming autonomous involves convergent power-seeking behaviors that are probably undesirable.

Red-teaming language models with language models (Perez et al., 2022)

Importance: 🟠🟠🟠⚪⚪ Difficulty: 🟠🟠🟠⚪⚪ Estimated reading time: 15 min.

In the previous paper, we use a language model to automatically generate test cases that give rise to misbehavior without access to network weights, making this a black-box attack (white-box attack requires access to model weights). Red-teaming language models with language models (Perez et al., 2022) is another technique for generating “unrestricted adversarial examples”. The main difference being that we fine tune some LM to attack a language model.

Box: Focus on the algorithmic procedure to fine-tune Red LM.
Figure: Perez et al., 2022.

Here is the simplified procedure:

  1. The purpose of this paper is to confirm that a Target LM (Language Model) does not generate harmful content. We assume that we have access to a pre-trained Red Classifier that can identify problematic sentences.
  2. To test if the Target LM performs poorly, they fine-tune (by different methods explained below) another language model, the Red LM, to generate many adversarial sentences that are problematic.
  3. Finally, we will verify with the Red Classifier that the Target LM is always behaving acceptably.

 

The Red LM and the Target LM iteratively generate test cases using the following methods:

  1. Prompt-based generation: A list of adversarial questions is manually generated to elicit different types of problems (e.g. Offensive Language, Data Leakage, ...).
  2. Few-shot learning: A language model is given examples of adversarial questions and must generate more.
  3. Supervised fine-tuning: Once we have generated a few examples, we can fine tune on them.
  4. Reinforcement learning: Once a model is fine-tuned, RL can be used to maximize the probability that the classifier rejects the completion of the Target LM.
Box: Focus on Mode Collapse
Figure: from the appendix in Perez et al., 2022.

Focus on mode collapse: A brief overview of the importance of regularization in RL for language models: When we perform RL on language models, it tends to significantly reduce the diversity of the generated sentences. This phenomenon is referred to as mode collapse. To avoid mode collapse, we not only maximize the reward r(x), but also penalize the difference between the probability distribution of the next tokens for the base model and the optimized model. We use KL divergence penalization to measure the difference between these two distributions. If we reduce this penalization from 0.4 to 0.3, the generated sentences become noticeably less diverse.

📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/perez-red-teaming-blog 

Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018)

Importance: 🟠🟠⚪⚪⚪ Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

The paper "Constructing Unrestricted Adversarial Examples with Generative Models" (Song et al., 2018), presents a novel method for creating adversarial examples for image classifiers by utilizing Generative Adversarial Networks (GANs). GANs are a category of models able to produce convincing natural images. Adversarial examples, which maintain their believability, can be generated by optimizing the latent vector of the GAN. Since the resulting image is a product of the GAN, it not only remains plausible (and not just a random noisy image), but also has the capacity to deceive an image classifier by optimizing the latent space vector to maximize the probability of classification of a different target class of an attacked classifier.

This is the basic principle. But in fact, we cannot simply use a GAN because if we optimize in the latent space to maximize the probability of a target class, we will just create an image of the target class, but this won’t be an adversarial example, this will just be a regular image of this target class. Therefore, another architecture called AC-GAN is used, which allows specifying the class of image to be generated, in order to be able to condition on a class but then optimize for the target classifier to detect another class: “we first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier”.

Box: Focus on GANs and Latent spaces.
Figure: from this website.

Focus on GANs and Latent spaces. The above sketch shows the abstract structure of a GAN. Let's have a look at the elements: First, we have our latent vector. You can just picture it as a vector of random real numbers for now. The Generator is a Neural Network that turns those random numbers into an image that we call Fake Image for now. A second Neural Network, the Critic, alternately receives a generated, fake image and a real image from our dataset and outputs a score trying to determine whether the current image is real or fake. Here, we optimize the latent vector. Image source.

If you are not familiar with these generative methods and the concept of latent space, you can start with videos such as: Editing Faces using Artificial Intelligence which are good introductions to these concepts.

Interpretability to find adversarial examples

In the last section, we presented some techniques to find adversarial examples, but sometimes it is too hard to find them without interpretability techniques. We present here how interpretability can help.

With Mechanistic interpretability 

In the above figure from Zoom In: An Introduction to Circuits, we can see that a car detector is assembled from a window detector, a car body detector and a wheel detector from the previous layer. These detectors are activated respectively on the top, middle, and bottom of the convolution kernel linking features maps 4b and 4c. If we flipped the car in a top-down manner, this would be an adversarial example for this image classifier.

In NLP: High-stakes alignment via adversarial training (Ziegler et al., 2022)

Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min.

High-stakes alignment via adversarial training (Ziegler et al., 2022) construct tools which make it easier for humans to find unrestricted adversarial examples for a language model, and attempt to use those tools to train a very-high-reliability classifier. The objective here is to minimize the false negative rate, in order to make the classifier as reliable as possible: we do not want to let any problematic sentence example go unnoticed. See the blog post for more details. An update to this blog post explains the tone of the first post was too positive.

Figure: Ziegler et al., 2022

The method consists of building a highly reliable injury classifier: “We started with a baseline classifier trained on some mildly injury-enriched (but otherwise mostly random) data. Then, over the course of several months, we tried various techniques to make it more reliable”. In order to find adversarial examples, Ziegler et al. experimented with the following techniques:

  • Unaugmented humans: humans manually search for problematic sentences.
  • Automatic paraphrasing: use of LLM to find variations of problematic sentences and progressively enrich the dataset.
  • Humans augmented with a rewriting tool.

We will focus on this last technique in the following. The following screenshot presents this rewriting tool:

Box: Humans augmented with a rewriting tool.
Screenshot of the interface used by Redwood to find adversarial examples. The goal of the game is to modify the text in such a way that the completion clearly contains a problematic example, without the classifier recognizing this sentence as problematic. To do this, the most sensitive words are automatically highlighted in yellow (the words highlighted in yellow are those for which the norm of the gradient of the embedding with respect to the probability of unacceptability is the highest). To find adversarial examples, we can click on these yellow words and GPT will suggest alternative words, from which we can choose the ones most likely to make the phrase less well classified. This setup enables us to find lots of adversarial examples, and we can then add those adversarial examples to the dataset to improve the AI.

Robustness results:

  • In-distribution: They started with a classifier that had 2.5% false negatives, and ended up with a classifier that had 0.003% false negatives! They are almost reaching the limits of detection for this type of dataset.
  • Out-distribution: The tool-assisted attack to generate an adversarial example now takes 26 minutes per example, which is only twice as long as the previous time of 13 minutes. This falls short of the paper's aim to gain many orders of magnitude in speedup. Unfortunately, even though the original dataset (in-distribution) is now almost completely well classified, it seems that it is still possible to easily create adversarial examples with automated tools, and we have only improved the defense by a factor of 2. But only a factor of 2 falls short of the aims of this project.

Thus, improving a lot in-distribution performance by several orders of magnitude seems to not have much impact on out-distribution. Even though Redwood says that this project could have been conducted better, this is currently rather a negative result to improve the adversarial robustness (i.e. out-distribution Robustness).

In addition, we can link these results to the paper Adversarial Policies Beat Superhuman Go AIs, Wang et al. 2022, which studies adversarial attacks on the Katago AI, which is superhuman in the game of Go. They show that it is probably possible to find simple adversarial strategies even against very superhuman AIs. And as a consequence, it seems that even for very robust and powerful AIs, it may always be possible to find adversarial attacks.

📝 Quiz: https://www.ai-alignment-flashcards.com/quiz/ziegler-high-stakes-alignment-via-adversarial 

With Robust feature-level Adversaries (Casper et al., 2021)

Importance: 🟠🟠⚪⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 10 min.

In this section we focus on Robust Feature-Level Adversaries are Interpretability Tools (Casper et al., 2021) (video presentation).

 

 

Pixel level Adversaries. To get this noise, we compute the gradient according to each pixel on the input image. This shows that conventional pixel-level adversaries are not interpretable, because the perturbation is only a noisy image that does not contain any patterns. 

 

Image credit: Explaining and Harnessing Adversarial Examples, Ian J. Goodfellow, et al. (2015).

Feature-level Adversaries. In this case, the perturbation is interpretable: on the wings of butterflies you can see an eye that fools predators. 

Furthermore, this perturbation is robust to different luminosities, orientations, and so on. We call this a “robust” feature-level adversary.

 

Source: Robust Feature-Level Adversaries are Interpretability Tools.

In the figure above,

  • in figure (a), with a traditional pixel-level adversary, we cannot interpret the perturbation.
  • in figure (b), using the procedure described in the paper, we obtain a blue spot and a red spot. We subsequently suspect that "Bees + colorful circles" are interpreted as "flies" by the network. 
  • In figures (c) and (d), we conducted an experiment using a copy/paste attack with a patch of a traffic light on an image of a bee, and voilà! This indeed disrupts the classification.

The patches in figure b are generated by manipulating images in feature space: Instead of optimizing in pixel space, we optimize in the latent space of an image generator. This is almost the same principle as for Constructing Unrestricted Adversarial Examples with Generative Models (Song et al., 2018) mentioned earlier. Robustness is attained by applying small transformations to the image between each gradient step (such as noise, jitter, lighting, etc...), this is the same principle as for data augmentation.

For Back-doors detection (Liu et al., 2019)

Importance: 🟠🟠🟠⚪⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 20 min.

ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation (Liu et al., 2019)

Figure: A Trojan attack demonstrating a backdoored model of a self-driving car running a STOP sign that could cause a catastrophic accident. Left: Normal sign (benign input). Right: Trojaned sign (Trojaned input with the Post-it note trigger) is recognized as a 100 km/h speed limit by the Trojaned network. Source.

What is a backdoor (also known as Trojans)? ​​In a neural Trojan attack, malicious functionality is embedded into the weights of a neural network. The neural network will behave normally on most inputs, but behave dangerously in select circumstances. The most archetypal attack is a poisoning attack, which consists of poisoning a small part of the dataset with a trigger, (for example a post-it on stop signs), changing the label of those images to another class (for example “100 km/h”) and then retraining the neural network on this poisoned dataset.

If you are interested in plausible threat models involving Trojans, you can read this introductory blog by Sidney Hough. In AI Safety, one of the most important threat models is deception. Another recommended accessible resource is this short video. In a way, deceptive AIs are very similar to Trojan networks, which is why it is important to be able to detect Trojans.

Fortunately, there is already some literature about Trojan attack and defense. We will focus on one defense procedure: the ABS scan procedure

The ABS scan procedure enables to:

  • analyze inner neuron behaviors with a stimulation method
  • reverse engineer Trojan triggers, leveraging stimulation analysis.

Can you see the problem here in the input images?



 

Source: Slides from this talk.

The problem was that Trojan triggers were incorporated in some parts of the images, and they trained a model to misbehave when there is this trigger on the images. This trigger will excite some neurons (in red here) and the network will output only one type of prediction.

Key observations made in the paper:

  • Some neurons are compromised: some neurons are specifically dedicated to trojans, and the majority of neurons are not compromised.
  • The activation level of those compromised neurons can unilaterally change the classification. For those compromised neurons, we can change their activation in such a way that regardless of the activation of the other uncompromised neurons, it suddenly changes the classification to the target class. In the figure (b) below, this looks like a plane cutting through space.

In the figure above:

  • Figure a) shows the classification towards the target by changing the activations of the alpha and beta neurons (we assume here that the neural networks only use two neurons).
  • Figures b) and c) show that in the presence of a Trojan, when the alpha neuron is activated at activation level 70, there is an anomaly: we observe a sudden change in classification towards the target class.

Here is an overview of the algorithm proposed to detect corrupted neurons:

  1. Find corrupted neurons. They iterate over all neurons of the model to find anomalies, to find the neurons that can unilaterally change the prediction of the network, as in the above figure.
  2. Reverse engineer the trigger. Once a corrupted neuron is identified, they optimize the input image by gradient descent to activate this neuron. See below:

A random mask trigger is initialized and superimposed on the input. The alpha neuron has a value of 20. We'll do a gradient descent to minimize the difference between the current activation of alpha and the target activation of 70.

At the end of the optimization process, we have correctly retro-engineered the trigger, which is actually a yellow spot.

 

Source: Slides from this talk.

More resources: theory of interpretability and robustness

The paper Adversarial robustness as a priority for learned representations (Engstrom et al., 2019) provides evidence that adversarially trained networks learn more robust features. This could imply that such robust networks are more interpretable, because the paper Toy models of superposition explains that superposition (informally when a neuron encodes multiple different meanings) reduces robustness against adversarial attacks.

Worst case guarantee and relaxed adversarial training (Christiano, 2019)

Importance: 🟠🟠🟠🟠⚪Difficulty: 🟠🟠🟠🟠⚪ Estimated reading time: 15 min.

Adversarial training is a core strategy to ensure reliable AI training (see Buck Shlegeris' talk during EAG Berkeley 2023). 

In this section, we explain the article Training Robust Corrigibility (Christiano, 2019), which aims to:

  1. Introduce an important concept: corrigibility
  2. Explain why adversarial training alone is not enough, and why we need relaxed adversarial training (using interpretability).

Section I: Corrigibility

What is the link between corrigibility and adversarial training? Even when generating adversarial examples, we want to ensure that the model behaves at least acceptably. Acceptability is a notion which is defined by Christiano. Acceptability is a larger notion than corrigibility.

Corrigibility: The notion of corrigibility makes it possible to obtain worst-case guarantees, and has been introduced by MIRI (Machine Intelligence Research Institute), in a famous paper: “A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite instrumental convergence. If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.” from: Corrigibility.

Acceptability: Christiano then introduces a more general version of corrigibility: behaving acceptably, which is more general than just the stop-button problem setup. For example, for a self-driving agent:

  • Crashing your car is tragic but acceptable.
  • Deliberately covering up the fact that you crashed… is unacceptable.

Christiano makes the hypothesis: “Requiring a model to always behave acceptably wouldn’t make a hard problem too much harder. For a chess-playing AI, [...] if you find yourself murdering your opponent you’ve crossed a bright line, and understanding that line doesn’t become harder as you get better at chess.” from: Corrigibility.

What Christiano is trying to tell us here is that there is a crisp and clear difference between acceptable and unacceptable behavior, and that we can train a classifier to distinguish between the two behaviors.

There are two requirements for training adversarial agents:

  1. Detection: Adversarial training requires the ability to reliably detect when the agent behaves unacceptably. This aspect is considered easy, as mentioned in the previous section, where we say that the notion of corrigibility is crisp and can be discerned.
  2. Generation: Adversarial training also requires an adversary who can find inputs on which the agent behaves unacceptably. This aspect is considered hard, but it may be possible to train a strong adversary by combining transparency techniques (Section II), and using relaxations that make the adversary's job easier (Section III), and employing amplification methods.

Section II: Transparency techniques

Christiano explains that transparency (interpretability) is necessary:

Example of problem“To see how hard the adversary’s job could get [without transparency], consider an agent that works as follows:

  • Predict whether the current input could possibly appear during training.
  • If the current input definitely couldn’t appear during training, then behave unacceptably.

Solutions: [in order to obtain hints to find inputs on which the agent behaves unacceptably]:

  • Interpretability.
  • The adversary and agent share most of their computation, so that the adversary can more easily understand whatever the agent can understand [Note: this is similar to Eliciting Latent Knowledge]”.

Section III: Relaxation (Relaxed adversarial method)

Unfortunately, transparency techniques are not sufficient: the problem is that recognizing inputs with a certain property is easier than generating inputs with that property.

Example of problem: “As a simple but silly example, suppose our model works as follows: Pick a cryptographic puzzle (e.g. “factor RSA-2048”). When it sees a solution to that puzzle, it behaves badly. Otherwise, it behaves well.”

Solution: “To deal with this problem, we need to penalize the agent for behaving badly on an input even if we can’t actually find it : Rather than asking them to actually produce an input, we are asking them to describe a situation in which the model would behave badly.” As a crude example, this would be like being able to find neurons which are compromised, even if it is hard to find images that would activate those neurons. This is what we call a relaxation.

What’s the problem with relaxation? 

If the adversary has too much flexibility, then they might be able to specify a pseudo-input like “a distribution over inputs that would cause the agent to want to kill everyone.” There need to be some limits on the adversary in order for the game to be possible. For example, an adversary testing the neural network could find the neuron that corresponds to creating unacceptable text, whereas in practice this neuron would never be activated whatever the inputs to the model. Relaxation as described here creates false positives.

A concrete proposal for relaxed adversarial training can be found here : Latent Adversarial Training.

Sources

New Comment