I'm interested in the relation between mechanistic anomaly detection and distillation. In theory, if we have a distilled model, we could use it for mechanistic anomaly detection: for each input x, we would check the degree to which the original model's output differs from the distilled model. If the difference is too great, we flag it as an anomaly and reject the output.
Let's say you have your original model and your distilled model along with some function to quantify the difference between two outputs. If you are doing distillation, you would always just output . If you are doing mechanistic anomaly detection, you output if is below some threshold and you output nothing otherwise. Here, I can see three differences between distillation and mechanistic anomaly detection:
Overall, distillation just seems better than mechanistic anomaly detection in this case? Of course mechanistic anomaly detection could be done without a distilled model,[1] but whenever you have a distilled model, it seems beneficial to just use it rather than running mechanistic anomaly detection.
E.g. you observe that two neurons of the network always fire together and you flag it as an anomaly when they don't.
Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by different most-proximate reasons. In some sense this may be considered increasing the safe threshold for , such that mechanistic anomaly detection is worth it all things considered.
Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.
Thanks, this was interesting.
I'd rather say that some paths to alignment are about lacking certain capabilities.
If ultimately you want an AI to definitely-have some set of capabilities and lack some other set, you can get there by any combination of addition and subtraction (in the context of current AI that's like a growing blob of capabilities). But if some of the capabilities we want are pretty specialized (like ones that relate to precisely interpreting our preferences), it might be faster to add or accelerate them somehow rather than waiting for capabilities to grow past them and then pruning back everything-that's-not-what-we-want.
I think formal verification belongs in the "requires knowing what failure looks like" category.
For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.
Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.
Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification - how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
I don't work on this, so grain of salt.
But wouldn't this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn't formal in strongest sense, but could give you some guarantees that don't require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I'm struggling to come up with more concrete examples)
Thanks to Michael Ripa for feedback.
TL;DR
Two ways that AI systems can fail
Consider a simple way of dividing AI failures into two groups.
Because one cannot get feedback on them, unobservable failures are the harder part of AI alignment, and this is why the AI safety community is especially interested in them. It is important not to ignore the importance and difficulty of fixing observable problems, and there might exist a very valid critique of the AI safety community’s (over)emphasis on unobservable ones. However, this post will focus on unobservable failures.
What are the ways that we can try to tackle unobservable failures? One good way may be to use models and data that are designed to avoid some of these problems in the first place (e.g. use better training data), but this will be outside the scope of this post. Instead, I will focus on ways to remove problems given a model and training set.
A taxonomy of solutions
Consider two things that might make the process of addressing an unobservable problem in a model very difficult.
Now consider which approaches to tackling unobservable failures depend on solving each of these two challenges.
(Hard) Approaches that require both knowing what failure looks like and how it happens:
(Medium) Approaches that require knowing what failure looks like.
(Medium) Approaches that require knowing how failure happens.
(Easy) Approaches that don’t require either.
What does this framework suggest?
First, I have simply found this taxonomy useful for how I think about the hard part of alignment and the set of strategies we have for addressing it.
Next, I would not be surprised if we lived in a world in which the hard strategies – ambitious mechanistic interpretability and formal verification – just never pan out. I have written in the past about how AI interpretability might be unproductive in some ways. This framing might help underscore some of these points.
In particular, I think this taxonomy helps to suggest that latent adversarial training, mechanistic interpretability + heuristic model edits, and scoping models down might be important, tractable, and neglected strategies. None seem to get a lot of current attention among AI safety researchers in deep learning, but all three seem to be tractable and have a high upside. (I’d also group evals and mechanistic anomaly detection in with these, but they seem slightly less neglected at the moment.)
Latent adversarial training
I think that it is possible for latent adversarial training to do most of what the mechanistic interpretability field is trying to achieve with less invested effort. The useful thing about latent adversarial training is that if you can solve the oversight problem, training under the right latent perturbations will, at least in theory, make the model robust to unobservable failures. Both of those things are much easier said than done. But ambitious mechanistic interpretability requires both of these plus a precise understanding mechanisms responsible for bad behavior plus using that understanding to address them somehow. It just seems strictly harder.
Right now, I am working with some others at MIT on latent adversarial training in deep learning. We are hoping to use it to improve robustness for high-stakes reliability and to help remove/resist trojan implantation. Some work already exists that involves adversarial perturbations to model latents (e.g. this paper), but to our knowledge, we are working to be the first to use it for controlling OOD liabilities in deep learning. I think that much more work could be done in the future as well.
Mechanistic Interpretability + Heuristic Model Edits
Getting rid of knowledge in models that might be risky could be very useful for safety. It might be nice to be able to take a language model and erase its knowledge entirely on some risky domain. For instance, maybe future versions of ChatGPT should have all of their knowledge of biotech erased so that they cannot effectively help to make a bioweapon. How might we go about this?
Recently, there has been a lot of interest in editing factual associations in models. Prior works have modeled factual knowledge as labeled arrows between concepts, e.g. eiffel tower --> located in --> Paris. Maybe using some strategy based on these tools would work well, and we should find out. But not all types of knowledge seem most easily described as sets of simple, directed relations between entities (e.g. knowledge within a broad domain like biotech). So it may be useful to work on methods to completely erase a model's knowledge in large domains. Methods like this might be great to add to the alignment toolbox.
Scoping Models Down
This may be an underrated approach to alignment – just making the AI more narrow. Right now, state-of-the-art AI systems are developed under a pretrain-then-finetune paradigm. But it might be valuable from a safety perspective to move to a pretrain-then-finetune-then-scope-down one.
One useful approach for scoping models down could involve using network compression. If a model does something bad off distribution (like executing a precise plot to betray you), the forward passes through it are going to be mechanistically different in some way compared to the passes that are observed during development. Using a mechanistic compression technique to get rid of neural circuitry which is not needed for in-distribution performance might disrupt the ability of the network to make the same computations off distribution. This could incidentally remove much of the circuitry involved in some types of OOD failures without ever needing to identify those failures in the first place.
Network pruning has been observed to harm OOD performance before, and from a safety standpoint, this may be a feature – not a bug. Right now could be a good time to work on scoping models down for alignment via compression because of recent interest in causal compression methods (like causal scrubbing or ACDC). These methods are being used for mechanistic anomaly detection and automating “interpretations” of networks, but we might be able to get a long way with just the compressed networks themselves.
Another approach for scoping models down could be inspired by the continual learning literature which aims to make models better at learning new tasks without forgetting old ones. If we want to scope a model down, the whole point will be to make it forget previously learned information that isn’t related to the in-distribution data we supervisedly finetune it on. In this case, continued training, several approaches using distillation, excitation dropout, and unlearning followed by re-learning seem promising. However, there seems to be more work needed to develop methods and benchmark them. I think that one promising idea could also be to test how useful taking existing continual learning techniques and doing the opposite could encourage the type of plasticity/forgetting that would help to scope models down. Trojan removal seems like a promising testbed for this type of work.
Paying the Alignment Tax
When I have talked to people about some of these approaches, they sometimes bring up the concern that this might make the models bad and less competitive. But that’s kind of the point. AI safety is much more about a lack of certain capabilities than having certain capabilities. Any aligned model is going to be worse in some way than the easier, misaligned alternative. That’s just the alignment tax. For example, adversarial training and content filters make models worse too, and we tolerate that ChatGPT is often stubbornly unhelpful. It’s just the price to pay.
Let me know if you are interested in working on any of these ideas :)