Amazing stuff man! Please, please, please keep doing these for as long as you're able to find the time. Absolutely essential that LW gets regular injections of relevant work being done outside the EA-sphere.
(Would also be very interested in either SGD inductive biases or LM internal representations as the topic for next week!)
Introduction
This is week 3 of Quintin's Alignment Papers Roundup. This week, I'm focusing on papers that use interpretability to guide a neural network's training process. A lot of alignment proposals seem to involve a step like this.
Explanatory interactive learning (XIL) is an active research area that uses various methods to generate some form of explanation for the model's decisions (often a heatmap of the most important portions of the input). XIL then optimizes the model to either make model explanations match human explanations, or to apply generic priors of what good explanations should look like (e.g., that they should be sparse).
Alignment proposals that use interpretability to steer model training usually imagine they have access to mechanistic interpretability methods that track a model's internal computations. Current XIL methods rely on easier, ad-hoc explanation methods, such as input saliency maps. Such methods can scale to supervise a full training process, unlike current mechanistic interpretability.
I expect many alignment researchers to think that input saliency methods are insufficient to properly supervise an AI's training process. Even if that's true, I think studying current XIL methods is valuable, simply because we can actually do empirical experiments with them. Even if there are no directly transferable insights we can gain from current XIL methods (unlikely, IMO), we can still learn about the "logistics" of doing general XIL research, such as the best ways to quantify how our interventions changed the trained models, what sort of protocols help humans to scalably use oversight tools, etc.
Additionally, current ad-hoc explanation methods are (very) imperfect, even for the more limited form of explanations they aim to provide. I expect that any future mechanistic interpretability methods that do scale to steering training processes will also be imperfect. Current XIL methods offer an empirical testbed to learn to wield imperfect and exploitable interpretability methods to shape a model's learning process.
Papers
Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations
My opinion:
This is the first paper I know of that uses human saliency annotations to improve model training. I'm disappointed that they use LIME to validate the faithfulness of their saliency method, as opposed to approaches that I think are more robust like the deletion-based measure used here.
It's also interesting that their approach for finding diverse models is so similar to the approach independently discovered here and here of minimizing the similarities between the input gradients of multiple models.
A negative case analysis of visual grounding methods for VQA
My opinion:
This is the obligatory "empirically discovered improvements to neural net training processes may not work for the reason you initially assumed".
My guess is that it's beneficial to encourage neural nets to have sparse dependencies on their inputs, even without specific priors on what specific dependencies are best.Edit: my current best guess as to why random saliency labels work is that they are regularizing the gradient norm of the model's behavior with respect to its inputs, leading to smoother model behavior.
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
My opinion:
This paper shows human saliency annotations can help the trained models more than random saliency annotations, provided the saliency method actually reflects the model's decision making process. This underlines the importance of having saliency methods that actually reflect the model's decision making process, something which current saliency methods are pretty hit-or-miss at.
This paper is also interesting in that they actively optimize the model to make its decisions better conform to its saliency maps, which is a practice I've found to be frequent in the best performing XIL methods.
Look where you look! Saliency-guided Q-networks for visual RL tasks
My opinion:
This paper reports shockingly large gains in generalization and robustness to out of distribution perturbations.
It doesn't rely on human labels to identify important features. It trains the value function estimator to depend only on high-salience pixels and trains the network to predict its own saliency scores. These two regularizers apparently lead to much sparser saliency maps that match human priors for what's important in the task and vastly improved generality / robustness.
The improvements reported by this paper are so strong that I suspect some sort of confounder is at play. If not, this paper probably represents a significant advance in the state of the art for robust RL.
Both regularizers seem like they should increase the coupling between the saliency maps and the model's behaviors. This leads to sparser, more human-like saliency maps, despite not explicitly using human supervision of the saliency. The authors think this means the resulting models are more interpretable. Hopefully, the models actually depend on their saliency maps, such that we can supervise the training process by intervening on the saliency maps.
Saliency Guided Adversarial Training for Learning Generalizable Features with Applications to Medical Imaging Classification System
My opinion:
This paper mixes adversarial training and XIL on medical image classifications, a difficult domain where confounders are common. It seems like a good testbed for alignment approaches hoping to combine these methods.
This paper masks out low-saliency features of the input images, then adversarially optimizes the partially masked image. They then minimize KL divergence between model predictions on the adversarial partially masked images and on clean images. I thus count this paper as another example of training a model to match the saliency method.
ER-TEST: Evaluating Explanation Regularization Methods for NLP Models
My opinion:
This is a paper whose evaluation criteria seem more impressive than their actual results, which seem kind of hit or miss to me. I'm always glad to see papers that drill down to understand what exactly their method changes about the model's different capabilities and how the model generalizes to out of distribution test data.
The paper does act as reasonably strong evidence that the primary gains of XIL appear on out of distribution tests, not in-distribution tests.
ferret: a Framework for Benchmarking Explainers on Transformers
My opinion:
(see below)
"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification
My opinion:
I link these two papers because unfaithful saliency maps seem like the main bottleneck in current XIL practices. These papers seem like they offer good tools for evaluating the faithfulness of saliency methods.
Beyond Explaining: Opportunities and Challenges of XAI-Based Model Improvement
My opinion:
A very recent and pretty extensive review paper of different XIL methods, for those interested in a broader perspective on the field. XIL is a surprisingly large field given how rarely I hear mention of it in alignment circles. This roundup focused on methods that supervise a model's input saliency maps (an approach the review calls "augmenting the loss"), but there are other XIL approaches as well.
Conclusion
My main update after looking through the XIL literature is that it's probably a good idea to actively optimize your model to make it better conform to your saliency method. My original thinking had been that you'd want to use a saliency method that was correct regardless of the model's current parameters. Most XIL papers do not regularize the model to match the saliency method, and they usually have relatively small performance improvements on realistic tasks.
In retrospect, it makes sense that you can optimize the model to be better explained by your interpretability method. I expect it's possible to optimize a model to make it harder to interpret, and it's unlikely that random initialization + vanilla training makes a model maximally interpretable.
Good saliency heatmaps should satisfy certain properties. E.g., disrupting low-salience portions of the input should have less of an effect on the model's behavior compared to disrupting high-salience portions of the input. Optimizing the model to ensure these properties are satisfied doesn't ensure your saliency method is always correct, but it can rule out many definitely incorrect behaviors from your saliency method.
Of course, we should also look for saliency methods that are more robust, even without having to adapt the model to the method. It seems like the primary bottleneck in this respect is evaluating the actual robustness of a saliency method.
One idea I had for such an evaluation is to use the candidate saliency method for "negative XIL". Suppose we have a classification dataset and annotations that cover all possible features that the model could use to perform the classifications, such that there's no way to get good performance without paying attention to the labeled features. Then, we use XIL to penalize the model for attending to the labeled features, scaling the penalty term as needed until the saliency method assigns very low importance to the labeled features.
This training process directly incentivizes SGD to smuggle information past the saliency method, because that's the only way to get higher classification accuracy. We can thus rank the robustness of the saliency methods by the degree to which they decrease the model's performance. It essentially measures the degree to which the saliency method can constrain SGD away from lower loss solutions.
Anyways, I hope readers find these papers useful for their own research. Please feel free to discuss the listed papers in the comments or recommend additional papers to me.
Future roundups
For next week's roundup, I'm thinking the focus will be on SGD inductive biases.
My other candidate focuses are:
Let me know if there are any topics you're particularly interested in.