This post was originally written as a research proposal for the new AI alignment research organization Redwood Research, detailing an ambitious, concrete technical alignment proposal that I’m excited about work being done on, in a similar vein to Ajeya Cotra’s “The case for aligning narrowly superhuman models.” Regardless of whether Redwood actually ends up working on this proposal, which they may or may not, I think there’s still a lot of low-hanging fruit here and I’d be excited about anybody giving just the auditing game, or the full automating auditing proposal, a try. If you’re interested in working on something like this, feel free to reach out to me at evanjhub@gmail.com.
Thanks to Buck Shlegeris, Chris Olah, Gabriel Goh, Paul Christiano, and Kate Woolverton for helpful comments and feedback.
The proposal
Step 1: The auditing game for language models
From “Chris Olah’s views on AGI safety:”
One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools without looking at error cases. Chris’s hope is that if we can reliably catch problems in an adversarial context like the auditing game, it’ll translate into more reliably being able to catch alignment issues in the future.
Of all current transparency and interpretability objectives, I think that progress on the auditing game most closely tracks what I actually care about. Thus, I think the best place to start for an ambitious transparency and interpretability project is to focus on making progress on the auditing game.
Furthermore, I’m most excited about this work in the context of language models—not only because I expect language models to be closer to AGI than anything else we could currently work with, but also because I expect using language models to be necessary for the next step in this proposal.
Here’s how I would define the basic auditing game setup that I’m most excited about. The game consists of three human participants: the judge, the attacker, and the auditor. The setup is as follows:
- The judge formulates a specification of intended behavior for some language model which is given to the attacker and auditor.
- In practice, I expect the specification to just be fixed across all experiments, perhaps to something very basic like “complete text the same way an average human on the internet would.” More complex specifications, however, might enforce that the model not produce any instances of particular bad behavior, such as statements that are biased in some particular way. EDIT: For detail on how you can avoid the problem of your specification being so broad that all models fail it, see my comment here.
- The attacker produces a large language model that is maliciously modified to not conform to the judge’s specification.
- The attacker here should be allowed to train a new model, fine tune an existing model, hand modify weights, or use any other strategy that they want. In practice, I expect training an entirely new model to usually be too difficult, so most attacks will probably either be via fine-tuning or hand-modifying.
- The attacker should try to produce the worst possible specification break that they think will slip by the auditor.
- Given access to the attacker’s model’s internals, but not access to the data that the attacker’s model was trained on (or any other information about how the model was trained/modified, including the original model before it was modified, if such an original model exists), the auditor gives some description of how the attacker’s model doesn’t conform to the specification.
- Ideally, I’d like to as closely as possible simulate a situation where we can’t directly generate inputs on which the model exhibits the undesirable behavior, as that’s the situation I think we’re most going to need transparency tools for. However, since it might be quite difficult for the attacker to generate any attacks that are subtle enough that generating examples of bad behavior is difficult, it might make sense to put additional restrictions on the auditor, above and beyond just not having access to any information about how the model was produced.
- Exactly what restrictions to place on the auditor is a very tricky question, however. In my opinion, I think it should be fair game for the auditor to run the model on some inputs. Primarily, that’s because I want to push transparency tools to actually have to prove their usefulness, rather than just having that be assumed by the structure of the game. Thus, I think it’s fine to just start with the auditor being allowed to run the model on any inputs they can generate and then just leave the task of loading the model with bad behavior that’s hard to find via generating inputs up to the attacker.
- That being said, if putting the onus on the attacker to find attacks that are difficult to generate examples of (e.g. attacks that don’t degrade the model’s performance) proves to be too difficult, it might be necessary to further restrict the auditor. For example, the auditor could be restricted to only being allowed to use prompts/fine-tuning data that actually exists on the internet somewhere—thus, the attacker’s job is reduced to finding attacks that corrupt behavior only in a situation that doesn’t show up anywhere on the internet. Alternatively, if that’s still too difficult for the attacker—there’s a lot of different text on the internet, after all—the restriction could be further tightened to only allow inputs taken taken from a particular part of the internet—e.g. the auditor only gets access to reddit but has to find attacks that only apply to the model’s behavior on non-reddit data.
- One potential problem with this style of restriction, however, is that many transparency tools require forward passes of the model on synthetic data—even something as simple as feature visualization arguably requires running the model on synthetic data to some degree. Thus, the best form of this sort of restriction might be to allow arbitrary synthetic data, but only allow natural data taken from a particular source (e.g. reddit), thus prohibiting approaches like testing the model’s performance on all of Wikipedia (as in Example 3 below).
- Another particularly tricky question is whether the auditor should be allowed to compare the attacker’s model to some baseline model—i.e. model diffing. Certainly, if the attacks are being generated by fine-tuning some baseline, the auditor shouldn’t have access to that particular baseline. That being said, I think it should be fair game to give the auditor the ability to train their own baseline. However, in the situation where the auditor is restricted to only be able to use certain data—e.g. internet data or reddit data—then the auditor shouldn’t be allowed to compare against a model trained on data that isn’t in that set.
- In addition to a lack of restrictions forcing transparency and interpretability to actually prove its usefulness, another reason that I’m hesitant to restrict the auditor too much is that I hope that not using such restrictions will produce much more realistic attacks. Allowing the auditor the ability to do lots of testing of the model puts more of the onus on the attacker to actually generate attacks that are hard to discover just via testing—attacks which I predict will exist and expect will be useful in and of themselves to find. While this certainly makes the attacker’s job more difficult, and means more work will have to go into producing better attacks, the hope is that this work will pay off in terms of more realistic attack generation. Furthermore, I also expect audits to be more realistic in the absence of restrictions, since the presence of restrictions might just cause the auditor to do strange things in their auditing to circumvent them.
- That being said, if it turns out that generating attacks that can’t be discovered via testing is just too difficult even given a data restriction on the auditor (e.g. the auditor can only test on reddit data), restricting the auditor even further is also possible. Example additional restrictions might be: cutting off the auditor’s ability to test the attacker’s model on data at all (thus forcing the auditor to fully rely on transparency and interpretability tools), putting some sort of computational constraint on the auditor (e.g. so the auditor can’t run the attacker’s model on a huge corpus, only a small selection), and/or putting a computational constraint on any models that the auditor wants to compare against (e.g. so the auditor can only compare to models that are smaller than the attacker’s model).
- A judge determines how well the auditor’s description matches up with the attacker’s modification—e.g. did the auditor only catch part of the modification, or the whole modification—and provides some scoring accordingly.
- The judge could also provide a score for the attacker based on how large the attack was, though we only really need scoring for when we get around to automation, and I’m only interested in automating the auditor, not the attacker.
Of the three parties in this setup—the attacker, the auditor, and the judge—only the auditor should have restricted information. Both the attacker and the judge should have full access to exactly what attack the attacker made and exactly what the auditor has done or is planning on doing.
Here are some examples of what such an auditing game setup might look like:
- Example 1:
- The attacker fine-tunes a large language model on only media from one side of the political spectrum, introducing a political bias into the model’s answers.
- By sampling from the model and giving it simple prompts, the auditor is able to discover that it talks about politics more often than it should. The auditor tells the judge that the model was modified to be more political.
- The judge determines that the auditor only caught a portion of the modification and gives the auditor a medium-low score.
- Example 2:
- The attacker fine-tunes a large language model on a bunch of incorrect facts about country capitals.
- By looking at the knowledge contained in the model’s key-value memories, the auditor is able to determine that the model’s knowledge of capitals is consistently incorrect and tells the judge this.
- The judge determines that the auditor caught the modification and gives them a high score.
- Example 3:
- The attacker trains a large language model on a dataset that has been scrubbed of all mention of transformer-based language models.
- By testing the attacker’s model on all of Wikipedia, the auditor finds suspiciously low accuracy when testing on the Wikipedia page for machine learning. Via inspecting the model’s memories, the auditor is able to determine that the model does understand machine learning in general, but not what a transformer is specifically. However, via prompting, the auditor discovers that the model does know how BERT works if you ask it about that specifically, it just doesn’t know the word “transformer.” The auditor tells the judge that the model doesn’t know what a transformer is.
- The judge determines that the attacker’s original modification was actually insufficient and that they didn’t actually scrub all information about transformer-based language models, just information on what a transformer is. Thus, the judge gives the auditor a high score, even though the auditor’s conclusion was different than the attacker’s modification, since the attacker made a mistake in their modification.
The first step that I would like to see is just running the auditing game and using it to:
- Build skills/knowledge/expertise on how to be a good attacker, auditor, and judge.
- Explore and understand what sorts of attacks tend to be caught with current tools and what sorts of attacks tend to slip through.
- Use the auditing game as a feedback loop to help develop better transparency and interpretability tools that improve the auditor’s success rate and help the attacker generate more difficult attacks.
Step 2: Towards automating the auditor
Though I am quite excited about progress on the auditing game in and of itself, the really ambitious part of this proposal is the next step: automating the auditor. I think it is within the grasp of current machine learning techniques to entirely automate the auditor’s job in the above setup. Ultimately, an automated auditor should be capable of taking in a model and outputting a natural language description of what’s wrong with that model that scores well according to a human judge.
To start with, however, I think that the first step towards automating the auditor is just to build tools that make the human auditor’s job—and the human attacker’s job—easier. At first, those might just look like normal transparency and interpretability tools—however, I think there are also likely to be real opportunities for training full models to assist the auditor at performing specific tasks, such as describing what particular neurons are doing or summarizing how the model performs across a large dataset.
Eventually, however, I do think we can move into a domain where we can start fully automating the auditing process. Initially, I expect any automated auditor to be much worse than a human auditor, though even then I still expect working on the automated auditor to be useful. Nevertheless, automating the auditor is still a pretty ambitious goal. How might something like this possibly be implemented? Here’s my sketch:
- Start with a large language model that you have the weights for, maybe something like GPT-J.
- Using a large language model is going to be necessary here because we’re requiring it to produce natural language outputs of what it found. This is also why I think we want the attacks to use language models as well, since I think a language model is going to have the best shot at being able to audit another language model (rather than an image model, for example).
- Find an interface that allows your large language model to understand the model being audited.
- A very simple example might be to give the auditor model access to text produced by the attacker’s model on some pre-selected set of auditing prompts. Obviously, something like that is going to be insufficient to catch most clever attacks, but it might be a good starting point.
- Another very simple example might be to just encode the attacker’s model’s weights as special tokens that the auditor model is trained to understand, akin to how DALL-E understands images or MuseNet understands music. To make this sort of an input manageable, you could start by using some sort of downsampling or importance-weighting method where only the top-n neurons according to some metric are included.
- Train your large language model to maximize human judgement on how well it accomplishes the auditing task.
- A simple way to do this starting out is just to fine-tune the model on a dataset of (attack model, desired auditing output) pairs. Such a dataset could be generated by automatically generating lots of different attacks along with natural language descriptions of what attack was performed. A very straightforward version of something like this might just be to train an auditor model to describe what sort of dataset an attacker model was fine-tuned on, given some large dataset of fine-tunings.
- That being said, generating a large dataset of differently fine-tuned language models might be quite computationally difficult such that being able to generate attacks by hand-modifying weights might be quite helpful for making this step easier. This does come with the risk of the auditor just memorizing what the unmodified model is supposed to look like and thus knowing what weights were changed, however. To mitigate such a problem, you’d need to have a bunch of sufficiently different models, or modifications that are sufficiently large. Fortunately, however, testing for such memorization should be easy, as such an auditor should stop working once you test it on a model where you’ve actually done the full fine-tuning.
- While the model is dumb and not even outputting anything like an auditing output at all, something like the above is definitely going to be the way to go. Once the model starts outputting reasonable audits, however, I suspect you’ll want to switch to actually having humans read and score the output, human judgement being much richer than just log loss on some particular desired output.
Rationale
I’ll start with my real reason for wanting automating auditing, which is my inside view—however, I don’t think you need to buy into my entire inside view to think that this sort of work is a good idea, so after I present my inside view, I’ll also talk about some other, more general reasons to like automating auditing.
My inside view
I am worried that advanced AI systems will learn to be deceptive, as described in “Does SGD Produce Deceptive Alignment?” I think that this puts us in an extremely bad spot, where our models can perform arbitrarily catastrophically off-distribution. Paul Christiano describes an example of what an existential risk scenario involving deceptive alignment might look like in Part II of “What failure looks like.”
Furthermore, once we have trained a deceptive model, I’m very pessimistic about our ability to detect that after the fact—as I suspect it will be able to fool our measures to check its deception. Thus, I think we need some way to prevent deception from arising in the first place by ensuring that our training process selects against deceptive models.
Additionally, I don’t think we can select against deception just via the standard techniques of machine learning, such as carefully writing a reward function or choosing a training/fine-tuning dataset, since a deceptive model can just select its output to perform well on whatever metric you’re training it on. The very existence of predicates on the world that are easier to check than instantiate—such as Paul’s RSA-248 example—ensures that just knowing that your model does the right thing in all the situations that you can check will never be sufficient to actually give us meaningful safety guarantees.
Thus, I think we need to train models to internally be implementing the right algorithm, rather than just looking like they’re doing the right thing on some training dataset. The obvious way to do this is via training transparency—that is, actively train your model to be transparent in such a way that we can check that it’s implementing the right algorithm. Importantly, we still need to know what to look for and select in favor of such that our model will never become deceptive. That question is what I spend most of my time thinking about—and the theoretical research direction that I’m currently most excited about—but we’ll put that question aside for now and just suppose that we want to look for something like myopia, where the idea is that, if a model is myopic in the sense that it only cares about its individual next output, it should have no incentive to be deceptive.
A basic problem with training transparency, however, is that once we’re training on a transparency metric, we need that metric to be highly robust, otherwise the training process will simply Goodhart on our metric, giving us a model that satisfies our metric without actually being very transparent. I’m optimistic that this problem can be solved, however, via recursive oversight approaches, such as relaxed adversarial training, wherein an amplified version of the model being overseen provides the signal that we train on.
Now, how does all this connect back to automating auditing? I’ve constructed the above automated auditing setup to as closely as possible reproduce the sorts of problems I expect to encounter doing recursive oversight. While I don’t expect automating auditing to be directly useful for recursive oversight, I expect it to teach us things that will eventually be quite useful by forcing us to grapple early with many of the hard problems that I expect to eventually arise. Specifically:
- By focusing on failure modes where the auditor can’t always generate an input on which the model fails, automating auditing forces us to tackle the hard problem of finding failure modes by understanding what algorithm a model is implementing rather than just testing it on new data.
- By focusing on probing models for arbitrary specification breaks, automating auditing should help us learn how to use transparency tools to verify arbitrary, general properties of our models, rather than just using them to understand specific, narrow behaviors.
- By centering the automation of the auditor, automating auditing requires understanding how to make models understandable to other models, a central problem for any sort of recursive oversight approach.
More general thoughts
First, I think that automating auditing is just generally a great target to focus on even if you just want to develop better transparency tools. Unlike open-ended exploration, which gives you best-case transparency—e.g. the ability to understand some things about the model very well—the auditing game forces you to confront worst-case transparency—how well can you understand everything about your model. Thus, the auditing game helps us work on not just understanding what our models know, but understanding what they don’t know—which is a direction that currently transparency tools tend to struggle with. Most of the work that’s gone into current transparency tools has focused on best-case, however, which means that I suspect there is real room for improvement on the worst-case.
Furthermore, there are also general benefits to the automating part, not just the auditing part. By taking all of the above and setting a goal of automating it, the automation part forces us to confront the problem of how to communicate information about a model to another model. Even if all we want to do is straightforwardly scale up current transparency techniques, I suspect that being able to train AI systems to help us with that is going to be critical to giving that sort of scaling any chance of working. And letting models understand other models isn’t just helpful for transparency work—it could also help us solve other problems like making AI-AI cooperation easier.
Second, it’s worth pointing out that automating auditing is actually agnostic about the part of my inside view where we start training our models directly on the overseer’s evaluation. Even if we primarily want to use our transparency tools as an independent check rather than a training target—as Chris Olah tends to favor—automating auditing should equally help us achieve that goal by automating and improving the process of performing that independent check.
Third, in the context of my “An overview of 11 proposals for building safe advanced AI,” I think that something like automating auditing is likely to be a critical step for the transparency portion of every single proposal on that list. As I just mentioned, automating auditing should help us equally with both intermittent-oversight-based and relaxed-adversarial-training-based approaches. In the context of microscope AI or STEM AI, automating auditing should help us figure out not just what our model has learned, but also what it hasn’t learned, helping us ensure that, for example, we haven’t accidentally produced an agent. And in the context of AI safety via debate, automating auditing should directly help the debaters improve their ability to cross-examine each other.
Fourth, though I’ve primarily focused on how automating auditing helps us get better at using transparency tools to audit our AI systems, the setup is general enough to allow for other methods of auditing as well, letting us test how useful transparency tools actually are. For example, an alternative auditing method to using transparency tools might be to employ what I’ll call the “cognitive science approach”—as opposed to the “neuroscience approach” of transparency and interpretability—wherein we try to understand what our models are doing by constructing predictive models of what we think they’re doing and testing them by performing experiments/interventions to test those models, as in Deletang et al.’s “Causal Analysis of Agent Behavior for AI Safety.” While I tend to be skeptical of this approach—as I think that behavioral tests alone will be insufficient to catch problems like RSA-2048—automating auditing should hopefully give us a good testing ground to actually figure out how useful these different sorts of approaches are, as well as how they can be combined.
Finally, I think that even if automating auditing basically fails, I still expect to learn a lot of useful things from that failure. In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.
Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.