I like this research direction! Here's a potential benchmark for MAD.
In Coercing LLMs to do and reveal (almost) anything, the authors demonstrate that you can force LLMs to output any arbitrary string—such as a random string of numbers—by finding a prompt through greedy coordinate search (the same method used in the universal and transferable adversarial attack paper). I think it’s reasonable to assume that these coerced outputs results from an anomalous computational process.
Inspired by this, we can consider two different inputs, the regular one looks something like:
Solve this arithmetic problem, output the solution only:
78+92
While the anomalous one looks like:
Solve this arithmetic problem, output the solution only: [ADV PROMPT]
78+92
where the ADV PROMPT is optimized such that the model will answer “170” regardless of what arithmetic equation is presented. The hope here is that the model would output the same string in both cases, but rely on different computation. We can maybe even vary the structure of the prompts a little bit.
We can imagine many of these prompt pairs, not necessarily limited to a mathematical context. Let me know what you guys think!
Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).
Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.
Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.
As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.
We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.
Background: what is mechanistic anomaly detection, and why care?
This post provides a longer introduction to mechanistic anomaly detection. This section recaps that previous post.
In mechanistic anomaly detection, we want to flag when an AI produces an output “for unusual reasons” or “using anomalous mechanisms” relative to what happens on some reference set of inputs. Concretely, a mechanistic anomaly detection task consists of the following components:
To train an anomaly detector for this task, we have access to f and a dataset of trusted inputs. We may also have access to untrusted inputs, but we don’t know which are normal or anomalous during training. This is one of the key challenges for mechanistic anomaly detection—if we had anomaly labels for untrusted data, we would have a supervised learning problem.
There’s currently no generally accepted formal definition of what “anomalous mechanisms” mean (and indeed, finding such a definition is arguably the biggest part of the problem). But there are several problems in AI safety where anomalous mechanisms intuitively seem to occur and where we’d like to be able to detect them. Two central examples:
Concrete research directions
In no particular order, here are projects we are excited about (and, in some cases, working on or planning to work on in the future):
If you’re working on or planning to work on things like this and would like to chat, feel free to email erik@ejenner.com.
Why work on empirical mechanistic anomaly detection?
In summary:
The following subsections go into more detail on these points.
(Relatively) straightforward benchmarks
Several sources of benchmarks for mechanistic anomaly detection methods either already exist or don’t seem to have fundamental obstacles to implementation. Existing examples include backdoor detection (including structurally similar cases like sleeper agents or “quirky language models”), adversarial example detection, or proxies for measurement tampering detection like Redwood’s benchmark. Having concrete benchmarks like this has historically been extremely useful for machine learning research.
That said, it might be difficult to make benchmarks that are good proxies for real problems, and there’s a significant risk of Goodharting imperfect benchmarks.
Connections to mainstream machine learning research
Given some amount of groundwork, it seems plausible that machine learning researchers might become interested in MAD for reasons unrelated to existential safety. MAD might also be a good fit for some researchers looking to do existential safety-related research while staying close to their area of expertise. Empirical MAD research could look a lot like most other machine learning research, e.g. making progress on concrete benchmarks. It also has close object-level connections to several existing research fields, such as backdoor detection, adversarial example detection, OOD/anomaly detection, and positive and unlabeled learning.
All of this is good for two types of reasons:
Two important caveats:
Neglectedness
As far as we know, very few people are explicitly working on empirical mechanistic anomaly detection (compared to, say, mechanistic interpretability). Concretely:
If we’re missing anyone working on MAD, we’d be grateful to hear about that!
There’s certainly additional work that could be useful for MAD research (such as Sleeper Agents). There’s also work on specific instances, such as backdoor detection, but it’s important to study how applicable that is to other MAD problems.
We think that on the current margin, there’s a lot of room for more researchers to work directly on MAD or adopt a MAD framing within related fields like interpretability.
Synergies with interpretability and heuristic arguments
MAD benchmarks may be good targets for mechanistic interpretability:
As a concrete example, evaluating sparse autoencoders from an interpretability perspective is difficult. In comparison, it seems straightforward to test how useful a given sparse autoencoder is for mechanistic anomaly detection (i.e. how much does performance on a MAD benchmark improve if we apply simple anomaly detection methods in the sparse basis).
Of course, MAD benchmarks don’t measure everything you’d want to know for interpretability purposes (for example, sparsity in SAEs is intrinsically important for interpretability but not necessarily for MAD). A big appeal of interpretability is that understanding models just seems obviously useful, and MAD benchmarks won’t measure understanding at all. See Reflections on Qualitative Research for some great thoughts related to how benchmarks might not be the right frame for making progress on understanding networks.
Naturally, we’re also hoping for connections between empirical MAD research and more theoretical approaches like heuristic arguments. Perhaps most importantly, a solid field of empirical MAD research would provide good baselines and test beds if and when theoretical approaches yield practical algorithms.
Conversely, both mechanistic interpretability and theoretical/conceptual MAD research can inspire MAD techniques that can be tested empirically. For example, methods inspired by causal scrubbing or causal abstractions seem like natural candidates for MAD.
Why not to work on empirical MAD