Concrete empirical research projects in mechanistic anomaly detection

Erik Jenner; Viktor Rehnberg; Oliver Daniels

Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.

Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.

As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.

We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.

Background: what is mechanistic anomaly detection, and why care?

This post provides a longer introduction to mechanistic anomaly detection. This section recaps that previous post.

In mechanistic anomaly detection, we want to flag when an AI produces an output “for unusual reasons” or “using anomalous mechanisms” relative to what happens on some reference set of inputs. Concretely, a mechanistic anomaly detection task consists of the following components:

A function that we want to detect anomalies for. In an empirical context, this is likely a neural network.
A distribution of trusted inputs to $f$ . The behavior of $f$ on this distribution is “normal” by assumption.
A distribution of untrusted inputs. We want to classify inputs from this distribution as normal or anomalous (which will depend on $f$ ).

To train an anomaly detector for this task, we have access to $f$ and a dataset of trusted inputs. We may also have access to untrusted inputs, but we don’t know which are normal or anomalous during training. This is one of the key challenges for mechanistic anomaly detection—if we had anomaly labels for untrusted data, we would have a supervised learning problem.

There’s currently no generally accepted formal definition of what “anomalous mechanisms” mean (and indeed, finding such a definition is arguably the biggest part of the problem). But there are several problems in AI safety where anomalous mechanisms intuitively seem to occur and where we’d like to be able to detect them. Two central examples:

An AI might tamper with measurements to create observations that look good, even though the underlying reality isn’t. We hope that the “reason” for these tampered observations is different than on reference inputs where we know there hasn’t been tampering, and that we can automatically detect this difference in the AI’s internal representations.
An AI might perform well for instrumental reasons while it’s being trained and then perform a treacherous turn once it’s able to take over. During an attempted takeover, a crucial part of an explanation for the AI’s output is that it now thinks it’s able to succeed. This was never the case during training, so we can try to detect it as mechanistically anomalous.

Concrete research directions

In no particular order, here are projects we are excited about (and, in some cases, working on or planning to work on in the future):

Measurement tampering detection: Most methods in Redwood’s measurement tampering detection benchmark are pretty simple, and it seems likely better ones exist. At the same time, it would be great to have additional tasks, ideally on the difficult end of the spectrum (such as the “Function correctness” task) but with strong reasons to think they are solvable in principle (in the sense that them being unsolvable would be a big blow to measurement tampering detection as a whole).
High-stakes (online) backdoor detection: Backdoors/trojans are in some ways analogous to a treacherous turn or other high-stakes failures that are very rare during training but where we want to avoid even a single failure during deployment. For example, the Sleeper Agents paper is based on this connection and concludes that straightforward adversarial training can be insufficient to remove backdoors. Another popular topic has been extracting the backdoor trigger from models (e.g. the Trojan Detection Challenge), but this is also a difficult problem. A MAD approach seems more tractable and arguably sufficient: given an input, can we decide whether that input triggers a backdoor? There is some academic work on this (sample filtering backdoor defense), but it seems understudied, especially for LLMs and with a focus on existential safety applications.
Axes of difficulty for MAD: MAD problems vary along several axes, such as the availability of untrusted data or what (if any) distributional shift there is between trusted and normal untrusted data. It would be nice to have a better conceptual and empirical understanding of which of these choices matter the most for making MAD problems easy or hard and which types of methods tend to work in which settings. Studying questions like this could be a project in its own right, or it might happen more naturally as we explore specific benchmarks and methods.
Natural mechanism distinction benchmark: Natural mechanism distinction is structurally similar to backdoor detection, except that the “anomalous” behavior is already present in pretrained models rather than artificially inserted. This might make it a better proxy for realistic problems.
Better MAD proxies for scheming/deceptive alignment: As mentioned before, backdoor detection has some similarities to detecting a treacherous turn. But in data poisoning backdoor attacks (and for natural mechanism distinction), the model is explicitly trained to exhibit bad behavior. In contrast, the main worry for a scheming model is that it would exhibit bad behavior “zero-shot.” This might affect which MAD methods are applicable. For example, finetuning on trusted data is a decent backdoor defense method, but wouldn’t make sense for detecting a treacherous turn (since the “trusted” data is just normal training data). Detecting adversarial examples or broken spurious correlations might be a better proxy task from this perspective, but we’re also excited about coming up with even better options.
MAD using interpretability: As motivated in the previous post, good mechanistic interpretability should be able to solve MAD problems. MAD benchmarks might thus be a good way of evaluating mechanistic explanations (at least in some aspects). We are excited about both trying to tackle generic MAD benchmarks with interpretability methods as well as designing MAD-inspired benchmarks specifically for interpretability.
Interpreting anomalies (h/t Jordan Taylor): Many MAD methods can pinpoint what made the activations on a given input anomalous (for example, the fact that the component in a particular direction was unusually big). This might let us find interesting aspects of the model to interpret. Understanding these anomalies with interpretability could also be very useful for MAD (e.g. to decide whether a given anomaly was bad or to debug MAD methods).
SAEs for MAD: If sparse autoencoders work well, it seems plausible that they would also be useful for MAD (in addition to interpretability). We'd be excited to test different possible approaches to using them.
Robust classification-like approach: There might be many different ways of “explaining” a model’s behavior on the trusted data, and not all of them would be suitable for detecting anomalies. One approach (similar to Paul’s description here) is thus to search for an explanation that works on the trusted data but doesn’t work on a new untrusted input. We flag the input as anomalous if we can find any such explanation (within a certain search space). Something analogous might also help for ad-hoc MAD methods that don’t use sophisticated “explanations,” for example, the ASSET backdoor defense can be thought of as an instance of this idea.
Adversarially attack MAD methods: We are of course interested in benchmarks that are hard for existing MAD methods, but we could also attack MAD methods more directly. For example, we could directly finetune models to evade detection or search for adversarial inputs. Many versions of this would likely be more difficult to defend against than what we need for some applications, but they might still be interesting conservative tests.
Training against MAD methods: Can we use a detector to e.g. train a policy not to tamper with measurements, or does the policy just learn to evade the detector? This will likely depend on the details of the setup (and, of course, the detection method). But we’d be excited to get an initial sense of how difficult it is to make MAD methods robust to training against them and what factors make it especially easy or hard. This might have some overlap with adversarially attacking MAD methods.
(Causal) abstractions for MAD: (Causal) abstractions seem like a reasonable candidate for formalizing “reasons” or “mechanisms.” Though they ultimately seem less general than heuristic arguments, there are tractable ways of using them for MAD at scale right now, and we’re interested in exploring these MAD methods.

If you’re working on or planning to work on things like this and would like to chat, feel free to email erik@ejenner.com.

Why work on empirical mechanistic anomaly detection?

In summary:

Creating benchmarks for mechanistic anomaly detection seems much easier than for interpretability.
There seems to be a lot of low-hanging fruit, and we expect to be able to make incremental progress with standard machine learning methodology.
Existing fields in machine learning, such as backdoor detection, adversarial example detection, and others, may be able to contribute methods, benchmarks, ideas, and talent.
At the same time, only very few people are currently studying mechanistic anomaly detection with existential safety in mind; compared to interpretability, it seems highly neglected.
Empirical MAD research may have synergies with other directions in AI safety (mainly mechanistic interpretability and more theoretical approaches to MAD).

The following subsections go into more detail on these points.

(Relatively) straightforward benchmarks

Several sources of benchmarks for mechanistic anomaly detection methods either already exist or don’t seem to have fundamental obstacles to implementation. Existing examples include backdoor detection (including structurally similar cases like sleeper agents or “quirky language models”), adversarial example detection, or proxies for measurement tampering detection like Redwood’s benchmark. Having concrete benchmarks like this has historically been extremely useful for machine learning research.

That said, it might be difficult to make benchmarks that are good proxies for real problems, and there’s a significant risk of Goodharting imperfect benchmarks.

Connections to mainstream machine learning research

Given some amount of groundwork, it seems plausible that machine learning researchers might become interested in MAD for reasons unrelated to existential safety. MAD might also be a good fit for some researchers looking to do existential safety-related research while staying close to their area of expertise. Empirical MAD research could look a lot like most other machine learning research, e.g. making progress on concrete benchmarks. It also has close object-level connections to several existing research fields, such as backdoor detection, adversarial example detection, OOD/anomaly detection, and positive and unlabeled learning.

All of this is good for two types of reasons:

There is a large amount of existing work to build on. For example, existing backdoor attacks can serve as benchmarks for MAD, and existing methods from these fields can serve as inspiration or strong baselines.
Early empirical work on MAD, especially if it’s made legible to researchers from other fields, might lead more researchers to work in this frame.

Two important caveats:

MAD does not currently have a well-defined problem statement since what should count as a “mechanistic anomaly” is based on intuitive judgments. We hope that a canonical collection of example problems can alleviate this, but it would certainly be helpful to have a crisper definition (which might end up coming from a more theoretically motivated approach).
The standards of success are different in MAD compared to related fields. Backdoor detection and adversarial example detection have specific threat models different from ones motivated by existential safety. See this comment for potential issues stemming from different motivations like this, though we’re unsure how big the divide is in practice. (It certainly seems like a much smaller divide for empirical MAD than for ARC’s research.)

Neglectedness

As far as we know, very few people are explicitly working on empirical mechanistic anomaly detection (compared to, say, mechanistic interpretability). Concretely:

The Alignment Research Center is working on MAD, but mainly from a theoretical angle (or with experiments in very toy settings) and focused on using heuristic arguments. We think it makes sense to pursue that approach in parallel with a broad, empirically focused approach.
Redwood has published a benchmark on measurement tampering detection, but they don’t have concrete plans to do follow-up work.
EleutherAI’s Quirky Language Models paper has a mechanistic anomaly detection experiment, though the overall focus is on truthfulness/ELK. The Tuned Lens paper also has a MAD experiments for detecting jailbreaks.
We have been working on MAD for various amounts of time, but often only part-time, and overall we’re only getting done a fraction of all the obvious experiments.

If we’re missing anyone working on MAD, we’d be grateful to hear about that!

There’s certainly additional work that could be useful for MAD research (such as Sleeper Agents). There’s also work on specific instances, such as backdoor detection, but it’s important to study how applicable that is to other MAD problems.

We think that on the current margin, there’s a lot of room for more researchers to work directly on MAD or adopt a MAD framing within related fields like interpretability.

Synergies with interpretability and heuristic arguments

MAD benchmarks may be good targets for mechanistic interpretability:

Detecting anomalies is arguably one of the most compelling applications of mechanistic interpretability. In any case, it seems that mechanistic interpretability that’s good enough to be useful should quite easily solve the kind of MAD tasks we’ll use in the foreseeable future. (In contrast, extracting backdoor triggers from a model might just be very challenging, at least in the worst case.)
Progress is easily measurable without invoking any intuitive notion of human understanding.
There are strong baselines (likely much stronger than current interpretability techniques).

As a concrete example, evaluating sparse autoencoders from an interpretability perspective is difficult. In comparison, it seems straightforward to test how useful a given sparse autoencoder is for mechanistic anomaly detection (i.e. how much does performance on a MAD benchmark improve if we apply simple anomaly detection methods in the sparse basis).

Of course, MAD benchmarks don’t measure everything you’d want to know for interpretability purposes (for example, sparsity in SAEs is intrinsically important for interpretability but not necessarily for MAD). A big appeal of interpretability is that understanding models just seems obviously useful, and MAD benchmarks won’t measure understanding at all. See Reflections on Qualitative Research for some great thoughts related to how benchmarks might not be the right frame for making progress on understanding networks.

Naturally, we’re also hoping for connections between empirical MAD research and more theoretical approaches like heuristic arguments. Perhaps most importantly, a solid field of empirical MAD research would provide good baselines and test beds if and when theoretical approaches yield practical algorithms.

Conversely, both mechanistic interpretability and theoretical/conceptual MAD research can inspire MAD techniques that can be tested empirically. For example, methods inspired by causal scrubbing or causal abstractions seem like natural candidates for MAD.

Why not to work on empirical MAD

MAD benchmarks where we know the ground truth (and can thus evaluate methods) might all be bad proxies for realistic cases (and it might be hard to know either way).
- This is a serious concern (just like it is for some other empirical safety research). We hope we can alleviate this in various ways. For example, we can have many different benchmarks that isolate the various difficulties we expect and look for methods that do well on all benchmarks simultaneously. We can also be pessimistic and artificially make benchmarks harder (e.g., by adversarially attacking detection methods).
You might think that approaches like AI control or others make MAD unnecessary in the short term and that an ambitious worst-case approach like ARC’s makes more sense in the long term.
- We’re definitely in favor of pursuing things like this in parallel to empirical work on MAD. Empirical work on MAD seems great on the current margin, but some other approaches are also highly promising and neglected!
In realistic cases, we might often have additional structure we want to exploit, compared to an archetypical MAD setup. For example, for measurement tampering detection, we might have cases of failed tampering (and if we do, we’d certainly want to use those somehow). We might also have a small number of anomalous examples or create “fake” ones, as in coup probes. It’s unclear how useful studying the MAD setup in this post is if we ultimately might use more specialized methods.
- We’re pretty excited about these directions as well! (And we might work on them at some point.) Starting with generic methods for a general MAD setup seems good for building up a systematic understanding of what works, but it’s plausible that focusing on specific applications would be better overall. We expect there to be some synergies. For example, generic MAD methods are obvious baselines for specialized approaches.

[-]Sodium2mo30

I like this research direction! Here's a potential benchmark for MAD.

In Coercing LLMs to do and reveal (almost) anything, the authors demonstrate that you can force LLMs to output any arbitrary string—such as a random string of numbers—by finding a prompt through greedy coordinate search (the same method used in the universal and transferable adversarial attack paper). I think it’s reasonable to assume that these coerced outputs results from an anomalous computational process.

Inspired by this, we can consider two different inputs, the regular one looks something like:

Solve this arithmetic problem, output the solution only:

78+92

While the anomalous one looks like:

Solve this arithmetic problem, output the solution only: [ADV PROMPT]

78+92

where the ADV PROMPT is optimized such that the model will answer “170” regardless of what arithmetic equation is presented. The hope here is that the model would output the same string in both cases, but rely on different computation. We can maybe even vary the structure of the prompts a little bit.

We can imagine many of these prompt pairs, not necessarily limited to a mathematical context. Let me know what you guys think!

[-]Erik Jenner2mo20

Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous---it does intuitively seem like a different reason for the output, given that it doesn't vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don't understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).

[-]Sodium2mo10

I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for "model does thing for anomalous reasons."

LESSWRONG
is fundraising!
LW
$

43