Planned summary for the Alignment Newsletter:
This is basically 3 months worth of Alignment Newsletters focused solely on interpretability wrapped up into a single post. The authors provide summaries of 70 (!) papers on the topic, and include links to another 90. I’ll focus on their opinions about the field in this summary.
The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.
There have been lots of proposals for how to evaluate interpretability methods, leading to the [problem of too many standards](https://xkcd.com/927/). The authors speculate that this is because both “methods” and “evaluation” papers don’t have sufficient clarity on what research questions they are trying to answer. Even after choosing an evaluation methodology, it is often unclear which other techniques you should be comparing your new method to.
For specific methods for achieving interpretability, at a high level, there has been clear progress. There are cases where we can:
1. identify concepts that certain neurons represent,
2. find feature subsets that account for most of a model's output,
3. find changes to data points that yield requested model predictions,
4. find training data that influences individual test time predictions,
5. generate natural language explanations that are somewhat informative of model reasoning, and
6. create somewhat competitive models that are inherently more interpretable.
There does seem to be a problem of disconnected research and reinventing the wheel. In particular, work at CV conferences, work at NLP conferences, and work at NeurIPS / ICML / ICLR form three clusters that for the most part do not cite each other.
Planned opinion:
This post is great. Especially to the extent that you like summaries of papers (and according to the survey I recently ran, you probably do like summaries), I would recommend reading through this post. You could also read through the highlights from each section, bringing it down to 13 summaries instead of 70.
Hi Rohin! Thanks for this summary of our post. I think one other sub-field that has seen a lot of progress is in creating somewhat competitive models that are inherently more interpretable (i.e. a lot of the augmented/approximate decision tree models), as well as some of the decision set stuff. Otherwise, I think it's a fair assessment, will also link this comment to Peter so he can chime in with any suggested clarifications of our opinions, if any. Cheers, Owen
Sounds good, I've added a sixth bullet point. Fyi, I originally took that list of 5 bullet points verbatim from your post, so you might want to update that list in the post as well.
This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.
I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!
I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.
It's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below).
Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away.
In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!
I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.
Thanks a lot for this--I'm doing a lit. review for an interpretability project and this is definitely coming in handy :)
Random note: the paper "Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction" is listed twice in the master list of summarized papers.
I finished this book (by 'finish', I mean read through Chapt 4 through Chapt 7, and read them three times).
Here's suggestion and what I think:
I have not read the book, perhaps Peter has.
A quick look at the table of contents suggests that it's focused more on model-agnostic methods. I think you'd get a different overview of the field compared to the papers we've summarized here, as an fyi.
I think one large area you'd miss out on from reading the book is the recent work on making neural nets more interpretable, or designing more interpretable neural net architectures (e.g. NBDT).
Peter Hase
UNC Chapel Hill
Owen Shen
UC San Diego
With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post.
Introduction
Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020); model transparency appears throughout An overview of 11 proposals for building safe advanced AI (2020); and explainable AI has a Twitter hashtag, #XAI. (For more on how interpretability is relevant to AI safety, see here or here.) Interpretability is now a very popular area of research. The interpretability area was the most popular in terms of video views at ACL last year. Model interpretability is now so mainstream there are books on the topic and corporate services promising it.
So what's the state of research on this topic? What does progress in interpretability look like, and are we making progress?
What is this post? This post summarizes 70 recent papers on model transparency, interpretability, and explainability, limited to a non-random subset of papers from the past 3 years or so. We also give opinions on several active areas of research, and collate another 90 papers that are not summarized.
How to read this post. If you want to see high-level opinions on several areas of interpretability research, just read the opinion section, which is organized according to our very ad-hoc set of topic areas. If you want to learn more about what work looks like in a particular area, you can read the summaries of papers in that area. For a quick glance at each area, we highlight one standout paper per area, so you can just check out that summary. If you want to see more work that has come out in an area, look at the non-summarized papers at the end of the post (organized with the same areas as the summarized papers).
We assume readers are familiar with basic aspects of interpretability research, i.e. the kinds of concepts in The Mythos of Model Interpretability and Towards A Rigorous Science of Interpretable Machine Learning. We recommend looking at either of these papers if you want a primer on interpretability. We also assume that readers are familiar with older, foundational works like "Why Should I Trust You?: Explaining the Predictions of Any Classifier."
Disclaimer: This post is written by a team of two people, and hence its breadth is limited and its content biased by our interests and backgrounds. A few of the summarized papers are our own. Please let us know if you think we've missed anything important that could improve the post.
Master List of Summarized Papers
Our Opinions by Area
Theory and Opinion
There has been good progress in the theory underpinning explainability. Insights from psychology, linguistics, and philosophy have helped authors understand slippery terms like "explanation" in the context of AI. This research has yielded new or clearer concepts to work with, e.g. simulatability, plausibility, (aligned) faithfulness, and (warranted) trust.
We have seen strong arguments for developing explainable AI with special focus on the social nature of explanations, involving the use of mental models, models of the self, and theory of mind.
Several papers have called for work on interpretability to be more strictly scientific, i.e. by asserting falsifiable hypotheses and measuring properties of explanation systems rather than proceeding solely with non-systematic qualitative analysis. These calls are made in response to trends essentially in the opposite direction.
Theoretical work in the area has heavily influenced some subsequent work in evaluation and methodology, though a substantial fraction of papers tend to re-introduce desiderata or key considerations relevant to their methodology rather than directly drawing on prior theoretical work. Some methods papers do explain their goals with the latest terminology but use the terms in only the loosest of senses. There is plenty of ongoing debate about the terms and concepts currently in use in the field and the ultimate purposes of interpretation and explanation methods.
Evaluation
There are now many, many ways to evaluate explanations, including procedures for evaluating explanations of arbitrary format. The approaches include both automatic procedures as well human study designs, and the bulk of the work has focused on feature importance estimates. We are excited by many of the approaches, particularly those assessing whether explanations improve human-AI team performance at a task that is hard for either humans or AI alone.
One trouble here is that there are so many evaluation procedures, it can be hard for methods papers to choose which to use. It at least seems like each evaluation procedure is equally likely to be used in any given methods paper (though there is a noticeable preference for automatic methods over human studies). We imagine this trend arises partly from the following situation: (1) there is not a common understanding of which explanation procedures answer which research questions; (2) methodologies are introduced without sufficiently precise research questions in mind. (Alternatively, papers can truly need their own new evaluation schemes, because they are answering new questions.)
Here's an example of the above situation. There is a lot of confusion over what the actual object of our explanation is within the subarea of feature importance estimation. Several research questions present themselves: should feature importance estimates explain the role of features in (1) the behavior of particular trained model with fixed weights, or (2) the behavior of trained models obtained by a stochastic training procedure involving a model family and dataset, or (3) the solvability of a task, either in theory or with respect to a given training procedure, model family, and dataset? Each research question stems from a fundamentally different goal, but papers rarely distinguish between them. Do we want to learn about a given model, a family of models, or the true nature of a task? There is not yet a clear and commonly accepted set of evaluation procedures suited for each of these questions which papers on feature importance estimation can readily use. The result is that the literature is not nearly as cumulative as it could be. For any given research question, it is hard to find one-to-one comparisons between more than a couple papers which would help you tell which methods are well suited to answering the question.
A similar situation arises with evaluations comparing generated explanations to human explanations. Many papers compare model explanations to human "gold label" explanations for data points. This is an evaluation for plausibility but not faithfulness, and this practice may reward deceptive explanation methods (since this is rating explanations by how convincing they are, but not by how truthful they are). But you could also be comparing your explanations to human explanations to figure out if your model is reasoning in a similar way to how humans reason. In many cases, this is a goal for our models, so that seems good. (Of course, this assumes your explanations are actually faithful to the model's reasoning.) We see a lot of papers that use human explanations as their "gold label" to compare against while not even saying whether they aim to optimize for explanation plausibility or alignment between model and human reasoning.
Lastly, qualitative analysis remains very popular throughout methods papers. It would likely be a marginal improvement to the field if some standards for qualitative analysis were more widely adopted, or someone wrote something good about what those standards should be. We do not mind "expert evaluation (by the author)" of the kind where the authors carry out some systematic qualitative coding regarding their method performance, but this quickly looks less like standard qualitative analysis and more like a measurable outcome.
Methods (General Commentary)
I'll give just three points here.
First, at a high level, there has been clear progress in methodology. There are cases where we can (1) identify concepts that certain neurons represent, (2) find feature subsets that account for most of a model's output, (3) find changes to data points that yield requested model predictions, (4) find training data that influences individual test time predictions, and (5) generate natural language explanations that are somewhat informative of model reasoning.
Second, it seems like every sub-field of ML has its own stream of research, and they often don't cite one another. There's a cluster of work in ICML/ICLR/NeurIPS, and a cluster in NLP conferences, and a clustering in CV conferences, and they often don't cite extremely similar methods or evaluation procedures from other subareas. Of course these days finding all the relevant papers to your work is a daunting problem in its own right, but the literature seems much less connected and cumulative than it should be.
Third, it is difficult to assess which other methods a paper should compare to when presenting its own new method. What if you say that a goal of your feature importance estimation method is "model debugging"? Though there are dozens of feature importance estimation methods you could compare to, you could also compare with counterfactual generation methods. Arguably every future paper with "model debugging" as a goal of their method should also compare to CheckList, a simple but extremely effective unit-testing approach, regardless of the category their method falls into. Yet this would require expensive (expert) user studies. It is much easier to use automatic feature importance evaluations and forget about other approaches to model debugging. Right now there is a serious trade-off between ease of evaluation and breadth of comparison for explanation methods, which is hampering comparison with relevant alternative approaches.
Methods: Estimating Feature Importance
Methods: Interpreting Representations and Weights
Methods: Generating Counterfactuals and Recourse Procedures
Methods: Explanation by Examples, Exemplars, and Prototypes
Methods: Finding Influential Training Data
Methods: Natural Language Explanations
Methods: Developing More Easily Interpreted Models
This is an interesting and potentially useful area of research. So far, it seems like training models on vast amounts of visual and linguistic data leads to the models learning many crisp human-interpretable concepts (uncovered upon inspection). Is this a guaranteed property of training models on natural or human-generated data? How many of the concepts these models represent are not easily interpreted, for whatever reason, and can this situation be avoided?
This area also includes strong skeptics of explaining deep learning models, which can seem both fair when considering all the shortcomings of the explainability literature and unfair when considering all the clear progress we've seen the past few years.
Overall, we think this is a promising area, but we are also not convinced work from this area will ultimately influence how the highest-performing ML systems will be developed.
Methods: Robust and Adversarial Explanations
There are a few distinct things happening in this area. A few exceptional contributions involve work on (1) robustness to distribution shifts, i.e. explaining models in a way that explanations are still faithful even under distribution shifts, (2) deceptive models, where a specially trained model can deceive an explanation procedure into thinking it is not heavily biased with respect to certain features, when in fact it is, and (3) searching for data points that yield logically inconsistent model explanations under some explanation procedure, which is a pretty unsatisfactory state to end up in. These are all clearly important solutions/problems.
I've seen a few papers suggest that explanations for model predictions should be robust to small input perturbations in principle. This seems like a mistake, likely one of valuing plausibility over faithfulness. If model behavior is influenced by small perturbations, then explanations should describe this phenomenon, not smooth it over.
This area feels especially important going forward, and we would particularly like to see more work on how models might adversarially or accidentally hide information from being revealed by explanation procedures.
Explaining RL Agents
To date, most of the work here has been applying approaches from the Feature Importance Estimation and Counterfactual Generation literature, though there is also an interesting line of work focusing on causal models of agent behavior. Some interesting early results have emerged regarding the kinds of explanations that help users build better mental models of agents, but so far this area is so new that it remains to be seen what the most promising approaches are. Many of the concerns in the above Evaluation section also apply here.
I have come across surprisingly few papers in this area relative to its importance. There appear to be important questions unique to explaining agents (rather than classifiers). For instance, explaining agents' behaviors will require special consideration of actions' temporal dependence, agent "plans", and epistemic vs. instrumental rationality. And the whole exercise will be complicated by multi-agent scenarios. This area really merits a lot more work, and for people interested in long term AI safety and existential risks, this is plausibly the most important subarea of interpretability research.
Interpretability in Practice
Paper Summaries
Theory and Opinion (5)
2018
This paper is a (long) survey of research on explanations coming from philosophy and the social sciences, and it communicates important results from these fields and comments on connections with XAI.
To begin with the terminology: drawing directly from past works, the author defines an explanation as "an answer to a why-question" and interpretability as "the degree to which an observer can understand the cause of a decision." They "equate 'interpretability' and 'explainability.'"
The paper communicates four key research findings:
(1) Explanations are contrastive and invoke counterfactual cases (even if implicitly). Most authors in the area argue that all why-questions ask for contrastive explanations, even if the "foil" to the fact of the situation is not explicitly stated. Questions such as "why did she close the door?" have an infinite number of possible implicit foils, and someone offering an explanation will implicitly rely on one as they explain the event (e.g., why did she close the door rather than leave it open?) One natural choice of foil is the more expected event, relative to the observed event.
(2) Explanations are given by agents after they are selected from a set of possible explanations, according to certain preferences. These preferences include (at least) coherence, simplicity, generality, soundness, and completeness — which are often in tension with one another. Following conversational norms, explanations are often restrained to be relevant to a particular question and only give what is necessary to answering the question. The abnormality of cited causal factors is an important aspect too, even more important than our confidence in the causal links. For example, when explaining the famous Challenger space shuttle explosion, people would not reference the oxygen in the air as an explanation, even though it is more certainly a necessary condition for the explosion than the suspected cause of some faulty seals. Similarly, we typically do not seek explanations at all for unsurprising events.
(3) Probabilities are not as important as causes. An example: if a student wants to know why they got a 50% on a test, they will not be satisfied if told that most people in the class got around a 50%. It would be better to explain why most students got around a 50%, but it would be best to explain why that particular student got a 50% (i.e. what caused the grade). In general, giving statistical statements could be helpful, but it is critical to give the causal factors underlying the statistical generalizations.
(4) Explanations are social: the act of explaining is a transfer of information about an event's causal history between two people. Further, taking a conversational model of explanations helps us extend our understanding of conversation to the act of explanation, e.g. in applying Grice's maxims of communication and analyzing the role of linguistic markers which indicate perspective and manage listener expectations. Core to this view is the concept of theory of mind, since we now view explanations as communicative acts between an explainer and explainee designed to fill in gaps in the explainee's mental model. Implicature (leaving things as implied) is additionally of relevance. The conversational framework can be extended to what is more fundamentally an argumentative framework, where the explainer is arguing for the explainee to draw certain conclusions. The authors suggest this type of interactive format will be valuable in XAI regardless of the medium of communication.
The paper ultimately concludes that researchers in AI should better heed the work on explanations from adjacent fields, particularly philosophy, psychology, and cognitive science. If we leave it to almost exclusively computer scientists to design explainable AI, we risk the result of "the inmates running the asylum" — i.e., a situation where "the very experts who understand decision-making models the best are not in the right position to judge the usefulness of explanations to lay users."
The paper addresses a number of other interesting topics, including the possible levels of explanation (using Aristotle's four causes), the process by which people select or infer foils (e.g. by a notion of similarity in causal histories), and the relevance of folk psychology to AI (including the nature of social attribution of intentionality to AI systems), among others.
Artificial Intelligence
2020
This paper argues that the faithfulness condition for model interpretability is underdefined, reformalizes the notion of faithfulness, and proposes an explanation framework that better satisfies their reformalized faithfulness. On the first front, they clarify that model explanations are best understood as faithful if they attribute the correct causal chain to the model's decision process (described as causal attribution). A commonly desired feature of explanations, they observe, is not represented in this notion of faithfulness: that the causal chain is aligned with what is expected by people (described as social attribution). These concepts are applied in a case study of select-predict methods for text classification, which are composed of a selector module that extracts subsets of a text to pass to a text classifier for a final prediction. They find that existing select-predict methods actually produce selections (which are masks over the text) that can themselves be used to predict the label, relying on the masks alone and not the selected tokens. They describe this phenomenon as a special case of the general "Trojan explanation," where the explanation encodes information that influences model behavior in a way not naturally anticipated by people. The selection masks are faithful explanations, in the sense that we understand the role they play in the model, but when masks are predictive of the label, the select-predict method is not aligned. Specifically, it can be unaligned in two ways: (1) people expect the selections to be summaries that keep relevant information for the task while filtering out irrelevant information, or (2) people expect selections to be evidence in support of a prior decision, without having been a part of the decision. The select-predict methods violate both expectations in the same way, namely by influencing the decision of the prediction module by selecting tokens that favor a certain class.
In response, the authors propose a select-predict-verify approach. They consider a special setting where a model makes a prediction using a full text, and people provide text highlights after the fact that they think should point the model to a counterfactual decision (distinct from the observed prediction). Their approach is to find the minimal selection that is a superset of the human's suggestion and leads the model to predict the human's expected class. This approach better satisfies the faithfulness and alignment conditions, while providing for specific downstream use cases: users can (1) check that when relying on the "correct" evidence, the model would reach the expected conclusion, (2) find what evidence would be needed to correct a model prediction, (3) explore whether people would interpret evidence similarly to the model, when they are uncertain about the true label and want to rely on the model for advice. Lastly, our interpretation of their discussion section is that the authors believe aligned faithfulness is distinct from simulatability because simulatability can be high if people learn how explanations relate to model decisions, even when the relationship does not match initial human expectations, which is a case where explanations are not aligned.
arxiv
Evaluation (9)
2019
From Alignment Newsletter #101:
This paper presents an automatic benchmark for feature importance methods (otherwise known as saliency maps) called RemOve And Retrain (ROAR). The benchmark follows the following procedure:
Train an image classifier on a dataset (they use ResNet-50s on ImageNet, and get about 77% accuracy)
Measure the test-set accuracy at convergence
Using the feature importance method, find the most important features in the dataset, and remove them (by greying out the pixels)
Train another model on this new dataset, and measure the new test-set accuracy
The difference between the accuracy in (4) and in (2) is the measure of how effective the feature importance method is at finding important features
The idea behind retraining is that giving the original classifier images where many pixels have been greyed out will obviously result in lower accuracy, as they're out of the training distribution. Retraining solves this problem.
They benchmark a variety of feature importance methods (Gradient heatmap, Guided backprop, Integrated gradients, Classic SmoothGrad, SmoothGrad^2, VarGrad) on their benchmark, and compare to a random baseline, and a Sobel Edge detector (a hard-coded algorithm for finding edges in images). Only SmoothGrad^2 and VarGrad (which are both methods which ensemble other feature importance methods) do better than random. They can't explain why these methods perform better than other methods. They also note that even when removing 90% of the pixels in every image (i.e. the random baseline), the accuracy only drops from 77% to 63%, which shows how correlated pixels in images are.
NeurIPS
2020
From Alignment Newsletter #100:
In this paper the authors perform user tests on 5 different model agnostic interpretability methods: LIME, Anchor, Decision Boundary, Prototype Model and a Composite model (LIME Anchor and Decision Boundary). The use cases they test are a tabular dataset predicting income, and a movie-review dataset predicting sentiment of the review from a single sentence.
Their experimental setup consists of 2 tests: forward prediction and counterfactual prediction. In forward prediction, the user is shown 16 examples of inputs and corresponding outputs and explanations, and then must predict the model’s output on new inputs (without the explanation, which often gives away the answer). In counterfactual prediction, after seeing 16 examples, the user is given an input-output-explanation triple, and then must predict how the output changes for a specific perturbation of the input.
Throughout the results they use a significance threshold of p < 0.05 (they don't use Bonferroni corrections). Their study has responses from 32 different students who'd taken at least 1 computer science course, with some screened out for outliers or low accuracy during training. There are approximately 200 individual predictions for each method/dataset-type combination, and each method/prediction-type combination.
Overall, their results show that only LIME (Local Interpretable Model-agnostic Explanation) helps improve performance with statistical significance on the tabular dataset across both prediction settings, and only the Prototype model in counterfactual prediction across both datasets. No other result was statistically significant. The improvement in accuracy for the statistically significant results is around 10% (from 70% to 80% in the Tabular dataset with LIME, and 63% to 73% for Prototype in counterfactual prediction).
They also showed that user's ratings of the explanation method didn't correlate in a statistically significant way with the improvement the model gave to their predictions**.**
ACL
2020
This paper proposes a few quantitative metrics for explanation methods, which they hope will be used for method development and selection before final testing via human studies. They give separate metrics for feature extraction, feature attribution, and example-based methods. The metrics themselves are not particularly novel: the feature extraction metrics focus on mutual information between the extracted features and the input or predicted class. With attribution methods, which assign scores to tokens, they test how the scores relate to model outputs under a variety of input ablation procedures (omitting certain features from the input). The metrics for example-based explanations are similar to some of those in the DiCE paper.
There are a few interesting points in the paper. They evaluate several explanation methods using a known non-linear function, so that we know the true effect of features on the function outputs, and find that the Integrated Gradients method makes some mistakes which simpler gradient-based saliency methods do not (for a single function and data point). Commenting on the Remove-and-Retrain (ROAR) procedure of Hooker et al., they suggest that this procedure might better be viewed as data interpretation rather than model interpretation, since the ROAR scores for an explanation method will be highly dependent on whether a task/dataset is still solvable after certain features are removed. And they give an interesting desideratum for feature attribution methods: that predictions should be more precise when a given feature is known, relative to not knowing that feature, and that precision should correlate with the magnitude of the attribution score. They take care to distinguish this desideratum from other metrics which measure how knowing a feature moves predicted probabilities up or down, rather than the precision or confidence in predicted probabilities.
arxiv
2021
In a large (n=3800) pre-registered study with high reputation MTurkers, the authors explore how two factors, number of features in a model and model transparency, relate to three outcomes: simulatability, deviation, and error detection (to be explained below). To illustrate each aspect of the experiments, consider their experimental protocol:
Users are given eight features of an apartment for sale in New York City, and are asked to guess what a model will predict its selling price to be. The model is a linear model that uses either 2 or 8 of the 8 available features. Users are assigned into conditions using either the 2 or 8 feature model, and then further divided into a transparent condition where they see the model weights during the whole experiment or a blackbox condition where they never see the model weights. For each of the resulting four conditions, users are first asked to guess the model's prediction, and then they are shown the model's prediction and are asked to guess the true selling price of the apartment.
Comparing their guesses of the model's prediction and of the true price allow the authors to derive their three outcomes. Here, simulatability is measured as the user's accuracy at predicting the model output. Deviation is measured as the extent to which the user deviates from the model prediction in making their own guess at the apartment's true selling price. Lastly, error detection is measured for a particular subset of the apartments where the model prediction is overtly incorrect (the apartment features are outlying, in these cases). Users are said to detect the error when they deviate greatly in the direction of the correct value, and otherwise to not have detected the error.
Several results follow. First, users in the transparent conditions are better able to simulate the model than in the blackbox condition, and in both conditions it is easier to simulate the 2-feature model than the 8-feature one. Interestingly, users deviate from model predictions to the same degree in both transparent and blackbox conditions. And surprisingly, users detect model errors less frequently in the transparent condition than the blackbox condition. In further analysis of this last finding, they authors found that highlighting abnormal features of the apartments in the user interface effectively erased the difference in error correction between conditions, which they describe as "consistent with the idea that transparency can be overwhelming and cause users to overlook unusual cases."
CHI
Methods
Estimating Feature Importance (10)
2020
From Alignment Newsletter #95:
This paper presents a novel method, Neuron Shapley, that uses the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output. (Shapley values have been applied to machine learning before to measure the importance of features to a model's output, but here the authors use them to calculate neuron importance.) Due to several novel approaches and optimizations in calculating these Shapley values, the top k most responsible neurons (k ~ 30) can be feasibly found for large networks such as Inception-v3.
The authors demonstrate that finding these neurons enables the performance of model surgery. Removing the top 30 neurons that contribute to accuracy completely destroys the accuracy, whereas in expectation removing 30 neurons at random from the network barely moves the accuracy at all. Since the method can be applied to an arbitrary metric, this kind of surgery can be performed for other metrics we care about. For example, removing the neurons which are most responsible for vulnerability to adversarial attacks makes the network more robust, and removing the neurons most responsible for the class-accuracy imbalance (a fairness metric) makes the classes much more even, while only reducing the overall accuracy a small amount.
NeurIPS
2019
The paper introduces an explanation approach with the aim to select parts of an input that can be used to predict a blackbox model's output for the entire data point. The selection is made by their explainer model, which is optimized to trade-off between making selections informative of the label and keeping selection brief. These goals are formalized via mutual information. Since this formulation is intractable to optimize for directly, they optimize for a variational bound for this objective, which looks like maximizing the likelihood of the blackbox model's predictions under an approximator model, while regularizing the size of the selections and encouraging sparsity. That is, the explainer masks the input and the approximator predicts the blackbox model's output given the masked input. The explainer and approximator are jointly trained; masks given by the explainer are made differentiable by means of the GumbelSoftmax estimator, which is used to select exactly k (hyperparameter) elements from the input (a continuous approximation of an n-choose-k sample). Note that mask elements they use after the GumbelSoftmax are still in the unit interval (. Lastly, note that the units of selection will include, for text data: words, groups of words, or sentences; and for images: squares of pixels.
Their quantitative evaluations include models of biological data, MNIST, and IDMB sentiment analysis data. With the biological data, they identify a simple heuristic based on explanations of model predictions that they use to either accept or reject model projections. It seems that they check whether explanations for test data "match" explanations from another dataset in particular situations, and if so, they accept the prediction; among accepted predictions, test accuracy is higher than rejected predictions.
For MNIST alone, they ask graduate students at CMU with a background in ML to rate explanations on a scale of 0 to 5, where 0 corresponds to "No explanation", the intermediate range to "Insufficient or redundant explanation" and 5 to "Concise explanation." Their method gets an average of 3.53 (next best: SmoothGrad, 3.45).
For IMDB alone, they do a simulation test with MTurkers, where they ask users to predict the model output given the explanation only (using only correctly predicted data). With their method, users achieve 44.7% accuracy (next best, L2X: 35.6%, random is 33%).
For both MNIST and IMDB, they measure "approximator fidelity" (approximator accuracy at predicting blackbox output) and "rationale fidelity" (approximator accuracy using hard masks rather than continuous masks). Using hard rather than continuous input masks does not greatly reduce the approximator's accuracy. They do not find any statistically significant gains in approximator fidelity over the most similar existing method, L2X, but they do tend to find increases in rationale fidelity of between 2 and 10 percentage points in most situations for both datasets.
arxiv
2020
This paper proposes to explain the causal effect of features on model outputs, particularly high-level features that are not easily manipulatable in the input. In cases where the feature of interest, such as the use of adjectives in a text, can be easily omitted without influencing the remaining features in the input, this causal effect can be easily estimated as the difference in model outputs with and without this feature. But when altering or removing features violates the data generating process, e.g. by destroying the grammaticality of a text while trying to remove information about its topic, the causal effect of this kind of high-level feature cannot easily be estimated through a counterfactual data generation scheme.
Lacking access to a generative model that can remove high-level features from text without influencing any other features, the authors propose to compare model outputs using the original data representations and counterfactual data representations. These counterfactual representations are obtained from a counterfactual model which is trained for the task at hand with an adversarial loss term encouraging it to forget the target concept (the one whose effect is being explained). This requires supervision at the instance or even feature-level. An example: to calculate the effect of the adjective part of speech on a text classifier, the authors train their counterfactual model to perform the task at hand while ensuring that the model's final representations do not contain any information about whether each word was an adjective or not (using word level part-of-speech supervision). Then the causal effect of the adjective PoS on the model output is the difference between these model's outputs. The authors also make use of "control concepts" to check that this adversarial procedure is not removing other information that should be retained for the task at hand, though this requires even more supervision.
In order to compare against "ground truth" causal effects, the authors evaluate their method on tasks where the causal effect of a feature on a model is easily estimated by generating counterfactual data. For instance, they rely on sentiment analysis data where people's first names can be substituted with one another to change the likely gender or race represented in the example. Here, the ground truth effect on the model output is computed by checking the difference in model outputs for inputs with names substituted for one another. They also evaluate the effect of adjectives as described above, although one confusing thing about this evaluation is that they automatically remove adjectives from sentences in order to generate counterfactual data, but the adjectives themselves are left in the data when training their counterfactual model (only PoS information is eliminated).
Evaluations show that their method produces causal effect estimates close to the ground truth for each dataset, while baseline methods are highly inaccurate in some cases. Their baseline methods identify counterfactual points based on "passive observation" of features' presence/absence rather than causal intervention. They do not compare to any baselines based on perturbing data points, such as LIME. The authors suggest that Iterative Null Space Projection could serve as an alternative to the expensive counterfactual model training process.
arxiv
Interpreting Representations and Weights (5)
2017
This paper proposes a method for translating vector communications between artificial agents into natural language. The motivation here is that, while Deep Communicating Policies (DCPs) are solving multi-agent communication games, it is difficult to analyze the content of agent messages. If we can translate the messages into semantically equivalent natural language statements, that would help us understand how the agents are behaving.
This translation is made possible by collecting data of humans playing the same communication games as the agents. One game is a simple reference game, where one player describes an image to a second player, and the second player must pick which of two images the first player is describing. The second game is a simplified driving game where two cars must pass through an intersection without colliding when they cannot observe one another directly. So the authors collect data of humans playing these games.
Translation is predicated on a particular notion of meaning. The authors choose to use a denotational perspective, suggesting that the meaning of a message is represented by the distribution over the speaker's state that is induced in the mind of the listener. This is instead of the pragmatic perspective, which would define meaning by the actions induced in the listener. They next propose a translation algorithm based on finding the minimum KL divergence between two "meaning distributions" induced by two communications. At a high level, the procedure measures the quality of a translation from a model's vector to some natural language utterance in terms of how frequently the utterance induces the same belief distribution over speaker states that the vector communication would (averaged across states and weighted by how likely it is the vector would be used in each state). Then, the best translation for a given message is the argmax of this quality measure (equivalent to the argmin distance between belief distributions induced by the messages).
Experiments show that this procedure helps models and humans communicate during gameplay and solve the games together at above random rates, even close to model-to-model gameplay without translation. This evaluation is automated by rolling out a human game trajectory and substituting in a model for one of the humans, while the "human" side of the game does not actually listen to the model's messages. To simplify the problem, human utterances in these games are treated as categorical over a set of simple words or phrases that are typically used in the games. Interestingly, the authors find that a "direct translation" baseline, which is simply a model of p(vector|utterance) trained on data from states with both human and model communications, does equally well in terms of allowing model-human pairs to complete the games, but this baseline is much less denotationally accurate, as they find that these communications do not lead listeners to form accurate beliefs about the speaker's state.
ACL
2018
The authors propose a method for finding directions in a neural model's latent space that correspond to human concepts, like stripedness, and study the representations their method identifies for visual human concepts using GoogLeNet and Inception V3. The method itself is quite simple. First, the authors collect a small set of images depicting a concept, e.g. they collect images showing stripes or striped things (around 30 images in some experiments). Then, a linear model is trained to discriminate between the hidden states of the model when these images are passed through it and the hidden states obtained by passing randomly chosen images through the model. The randomly chosen images serve as a contrast set to the set representing the concept of interest. Finally, the Concept Activation vector is defined as the normal vector to the decision boundary, which points in the direction of the concept set's activations. Next, they define a score used for quantitative Testing with CAVs, (TCAV score). This score makes use of a directional derivative of a class score obtained for an image with respect to the CAV for a concept. Specifically, for a given class and a given CAV, the TCAV score is the proportion of images predicted as that class with a positive directional derivative with respect to the CAV. Hence the TCAV score ranges from 0 to 1, depending on what proportion of the images have positive directional derivatives. They intend this to capture how much a concept contributes to a model's prediction for a class (e.g., for images classified as a Zebra, how much did stripedness contribute?)
Two extensions of the CAV occur throughout the experiments: rather than just getting one CAV with one set of contrastive random images, they get 500 CAVs using 500 random batches of contrastive images. They omit experimental results for when these 500 CAVs do not yield statistically significant TCAV scores, as determined by a t-test with a null hypothesis of TCAV=0.5. Additionally, the contrast sets are sometimes not random images, but instead selected with the intention to better isolate the concept of interest. For example, the contrast set of the concept set of stripedness is the union of images representing dot and mesh textures.
Armed with the TCAV score, the authors perform a number of qualitative and quantitative experiments. They rank order images by their similarity to a CAV, and display the results. They make Deep Dream style visualizations by maximizing an image's hidden states's similarity to selected CAVs. They validate a number of expected associations between classes and concepts, as well as biases in the models' training data. In one experiment, they aim to validate TCAV scores against ground truth model reliance on a particular concept, by finetuning the models on a data subset where they control the presence of an easy-to-detect feature in the data (inpainted class names as text in the images). With these controlled datasets, they obtain models that either do or do not rely on the easy-to-detect feature in the images, as indicated by their accuracy on test sets without the feature. Then, they compare their ground-truth measure of model reliance on the feature with obtained TCAV scores for the feature CAVs. Showing results for two classes, they find a correlation between a model's reliance on the feature and TCAV scores for the feature. Using these same controlled datasets, they also measure how MTurkers think of feature importance based on gradient-based saliency maps of image predictions. They find that the saliency maps often highlight the easy-to-detect feature even when it is not being used by the model, which seems to mislead the respondents into thinking it was important to the model. Lastly, they obtain domain expert feedback on using TCAVs for a model of diabetic retinopathy, suggesting that "TCAV may be useful for helping experts interpret and fix model errors when they disagree with model predictions."
ICML
2020
From Alignment Newsletter #116:
Network dissection is an interpretability technique introduced in 2017, which uses a dataset of images with dense (i.e. pixel) labels of concepts, objects and textures. The method measures the areas of high activation of specific channels in a convolutional neural network, then compares these areas with the labelled areas in the dataset. If there's a high similarity for a particular channel (measured by the intersection divided by the union of the two areas), then we can say this channel is recognizing or responding to this human-interpretable concept.
This paper introduces an extension of this idea, where instead of just using the basic concepts (and matching areas in the dataset), they search through logical combinations of concepts (respectively areas) to try and find a compositional concept which matches the channel's activations. For example, a channel might respond to (water OR river) AND NOT blue. This is still a concept humans can understand (bodies of water which aren't blue), but enables us to explain the behaviour of a larger number of neurons than in the original network dissection method. Their work also extends the method to natural language inference (NLI), and they interpret neurons in the penultimate layer of a BiLSTM-based network trained to know whether a sentence entails, contradicts, or is neutral with respect to another. Here they create their own features based on words, lexical similarity between the two sentences, and part-of-speech tags.
Using their method, they find that channels in image classifiers do learn compositional concepts that seem useful. Some of these concepts are semantically coherent (i.e. the example above), and some seem to have multiple unrelated concepts entangled together (i.e. operating room OR castle OR bathroom). In the NLI network, they see that many neurons seem to learn shallow heuristics based on bias in the dataset - i.e. the appearance of single words (like nobody) which are highly informative about the classification.
Finally, they use their method to create copy-paste adversarial examples (like in Activation Atlas (AN #49)). In the Places365 dataset (where the goal is to classify places), they can crudely add images which appear in compositional concepts aligned with highly contributing neurons, to make that neuron fire more, and hence change the classification. Some of these examples generalise across classifier architectures, implying a bias present in the dataset.
arxiv
2021
From Alignment Newsletter #98:
This paper introduces the Loss Change Allocation (LCA) method. The method's purpose is to gain insight and understanding into the training process of deep neural networks. The method calculates an allocation of the change in overall loss (on the whole training set) between every parameter at each training iteration, which is iteratively refined until the approximation error is less than 1% overall. This loss change allocation can be either positive or negative; if it's negative, then the parameter is said to have helped training at that iteration, and if it's positive then the parameter hurt training. Given this measurement is per-parameter and per-iteration, it can be aggregated to per-layer LCA, or any other summation over parameters and training iterations.
The authors use the method to gain a number of insights into the training process of several small neural networks (trained on MNIST and CIFAR-10).
First, they validate that learning is very noisy, with on average only half of the parameters helping at each iteration. The distribution is heavier-tailed than a normal distribution, and is fairly symmetrical. However, parameters tend to alternate between helping and hurting, and each parameter only tends to help approximately 50% of the time.
Second, they look at the LCA aggregated per-layer, summed over the entire training process, and show that in the CIFAR ResNet model the first and last layers hurt overall (i.e. have positive LCA). In an attempt to remedy this and understand the causes, the authors try freezing these layers, or reducing their learning rate. The first layer can't be fixed (freezing makes its LCA 0, but later layers' LCA is increased in turn so the overall final loss stays the same). However, for the last layer, freezing or reducing the learning rate increases the overall performance of the network, as the last layer's LCA is decreased more than all the other layer's LCAs are increased. They also hypothesize that by reducing the momentum for the last layer, they can give it fresher information and make it more likely to learn. They find that this does work, though in this setting previous layers’ LCA increases to compensate, leaving overall performance unchanged.
Finally, the authors show that learning seems to be synchronised across layers; layers get local LCA minima at the same training iterations, in a statistically significant way. They show this must be a combination of parameter motion and the gradient, as neither on their own explains this phenomenon.
arxiv
Generating Counterfactuals and Recourse Procedures (4)
2020
This paper introduces a method for generating Diverse Counterfactual Explanations (DiCE) of binary classification models of tabular data. The primary motivation is recourse, i.e. **giving feasible steps for how to achieve a desired outcome when a model makes an unsatisfactory decision about them. **Use cases include offering recourse for credit risk assessment and loan approval decisions. They run experiments with datasets for these tasks.
Given an original data point and a desired model output, their goal is to identify a set of data points that are (1) valid, meaning the model outputs the desired class, (2) proximate, meaning close to the original data point, (3) diverse, meaning they represent a variety of paths to achieving the desired outcome, (4) sparse, meaning few changes from the original data point are proposed, and (5) feasible, meaning they follow causal laws of the data generating process. To solve this problem, they formulate the first three objectives (validity, proximity, diversity) in differentiable terms, and perform a gradient-based optimization to identify the set of counterfactual points. Sparsity is encouraged in a post-processing step where, for a candidate counterfactual point obtained from the optimization, changes from the original point are greedily selected until the counterfactual achieves the desired model output. The causal feasibility is enforced in another simple filtering step, where candidates are filtered out based on violations of simple user-specified causal principles (e.g., if education increases, age must also increase; education cannot decrease).
For neural models of the COMPAS recidivism dataset, an income-prediction dataset, and the credit risk assessment and lending approval datasets, they find that their approach successfully identifies sets of counterfactuals that reasonably satisfy automatic metrics for validity, diversity, and proximity, and they present qualitative examples. They also aim to measure how these kinds of explanations, consisting of one original datapoint and a set of counterfactuals, can communicate a model's local decision boundary. As proxies for humans trying to reason about a local decision boundary given an explanation, they fit a simple nearest neighbor model to one explanation at a time and evaluate how that model classifies data in a radius around the original data point. That is, they sample datapoints in a sphere around the original data point, and compute the accuracy of a 1-nearest-neighbor model that has look-up access to only the data points in the explanation. In this set-up, they find the simple 1-NN model achieves up to a 44 F1 score with k=4 counterfactuals. They suggest that examples from their method can "approximate the local decision boundary at least as well as local explanation methods like LIME."
ACM FAT
Explanation by Examples, Exemplars, and Prototypes (4)
2019
The goal of this paper is to get image recognition models to imitate one way that humans can classify images, which is by comparing parts of a new image to prototypical parts of known images. The authors suggest that a model that can do this is "interpretable, in the sense that it has a transparent reasoning process when making predictions." They propose a model for this purpose with the following structure: A CNN maps an image to a representation of shape H×W×D, which can be thought of as a set of HW vectors in RD. For every class, there are 10 vectors in RD that are prototype vectors for that class. For each prototype, a prototype activation is obtained by a function of the distance between that prototype and the nearest vector given by the CNN output, such that nearer vectors yield larger activations. A logit for each class is computed by a weighted sum of the prototype activations. By enforcing that prototype vectors correspond to some vector representation obtained from an actual training data point, they attain a model that makes predictions based on similarities in representations between a current data point and training data. To interpret how a model predicts a given class, the authors interpret the activations between an image's CNN output and the 10 prototype vectors for that class. The model is trained end-to-end with a step that constrains prototype vectors to be equal to representations from training data points.
The model interpretations are visualized by localizing the part of the query image that highly activates each prototype, as well as the region in the prototype's training image that it represents. For a given prototype, heat maps over the query image are generated by taking the activation score of that prototype with all HW vectors in an image's CNN output, then upsampling the resulting activation grid to match the original image size. Finally, a bounding box is obtained by selecting the small rectangle containing the top 5% of upsampled activation scores (at the pixel level).
With models trained for fine-grained image classification using the CUB dataset, the authors present an abundance of qualitative analysis of the model, including examples of classified data points and analysis of the latent space structure. The common pattern in the presented figures is that the image parts that most activate a given prototype reflect similar characteristics of the image: a bird with red feathers on its belly activates a certain prototype, and when bounding boxes are overlaid on the original image and the image from which the prototype comes, the bounding boxes cover the red bellies on the new image and the image the prototype came from. Another prototype might capture the bird feet, another the bird's eye, etc., each showing a bird in the training data where the prototypical part representation comes from. The authors note that, across models, a maximum drop of 3.5 percentage points in accuracy is observed between the prototype model and a blackbox model trained with the same CNN feature extractor. This drop in accuracy can be mostly ameliorated by ensembling several prototype models (though this does increase the parameter count and number of prototypes per class).
NeurIPS
Finding Influential Training Data (2)
2020
From Alignment Newsletter #97:
This paper presents the TrackIn method for tracking the influence of training datapoints on the loss on a test datapoint. The purpose of the method is to discover influential training points for decisions made on the testing set. This is defined (loosely) for a training point x and test point z as the total change in loss on z caused by training on x. They present several approximations and methods for calculating this quantity efficiently, allowing them to scale their method to ResNet 50 models trained on ImageNet.
The standard method of evaluation for these kinds of methods is finding mislabelled examples in the training dataset. Mislabelled examples are likely to have a strong positive influence on their own loss (strong as they're outliers, and positive as they'll reduce their own loss). Sorting the training dataset in decreasing order of this self-influence, we should hence expect to see more mislabelled examples at the beginning of the list. We can measure what proportion of mislabelled examples is present in each different initial segments of the list. The authors perform this experiment on CiFAR, first training a model to convergence, and then mislabelling 10% of the training set as the next highest predicted class, and then retraining a new model on which TrackIn is run. When compared to the two previous methods from the literature (Influence Functions and Representer Points), TrackIn recovers more than 80% of the mislabelled data in the first 20% of the ranking, whereas the other methods recover less than 50% at the same point. For all segments TrackIn does significantly better.
They demonstrate the method on a variety of domains, including NLP tasks and vision tasks. The influential examples found seem reasonable, but there's no quantification of these results.
arxiv
Natural Language Explanations (8)
2020
The authors develop a virtual cooking game as a testbed for a proposed method for robot-human collaboration. The core of their approach is that, in addition to online planning of its own actions, the robot maintains a mental model of the human's current plan. Whenever the robot thinks the human's plan deviates from its own by a certain amount, the robot sends a message to the human including: (a) its own goal, (b) the action it will next take, (c) and the outcome of that action, as well as (d) its expectation of the human's goal, (e) the action it thinks the person will take, and (f) the outcome of that action. The message comes in the form of schematic natural language, with variable/action names substituted in. The plans themselves are given by parsing a Spatial, Temporal, and Causal And-or Graph (STC-AoG), which represents goals, subgoals, and atomic actions for achieving subgoals. The authors present an algorithm for inferring human plans that "uses observed user actions and communication history to infer human mental state."
They recruit study participants from their university subject pool for a test of the robot and cooking game. They consider three conditions: one with no communication, a "heuristic" condition with communication every 9.3 seconds (based on a pre-study's frequency of people asking for help), and one with mental-model-based communication. Participants completed their first task in about 75 seconds on average in the mental-model condition, versus 150 in the control (p<.05 on difference; 125s in the heuristic condition, p=.12). On a 7 point "helpfulness" scale, the mental-model condition was rated got around 5.5 on average, versus 4 in the control and 3.5 in the heuristic (p<.05 for comparisons). In participants' second round of the game, times converged across conditions, presumably due to familiarity with the game. The authors note that disparities in plans may arise from differing subgoals, or misunderstandings about action preconditions or effects (on the human's end).
arxiv
2020
This paper assesses extractive explanations for models of textual data under a simulatability perspective, and they present explanation generation in terms of communication between an explainer and a listener (Layperson, as the paper puts it). The purpose of the explanation is to simply encode the model's output. An extractive explanation is a set of words pulled out of an input, and the procedures for generating these explanations follow a general format: rank words, then pull out the top k. What they are evaluating is the success of a procedure at encoding a label into a bag of words from a particular input such that a listener can extract the label from the code.
The ranking methods they consider include a classification model's attention weights (using various kinds of attention), a gradient-based saliency ranking, and a standard word omission procedure. The listener is either a human or a trained BoW model. They also consider jointly training the BoW listener along with an explanation model, E(x,^y,h), which extracts words from the input conditioned on the classifier's output and its final hidden state. Their automatic evaluation is to compute the listener BoW model's accuracy at predicting a classifier's output (Communication Success Rate, CSR). Their human evaluation is to use people as explainers, listeners, or both. With IMDB sentiment and SNLI natural language inference data, they evaluate CSR with all four combinations of human and machine listeners and speakers. They find that jointly trained machine explainers and listeners complete the task with 99%+ accuracy for both datasets. There is little to no statistical difference among the various attention-based top-k methods with human listeners: on sentiment analysis, CSR ranges from 87.5% to 93.25% (random: 50%); on NLI, it ranges from 70.5 to 74.5% (random: 50% — no neutral label). Humans successfully communicate the label 86.5% of the time on NLI.
arxiv
2020
The key question the paper addresses is how to evaluate natural language explanations generated by a model. Past works have done so primarily by training a model for particular tasks, using human explanations for data points as supervision for explanation generation, then comparing explanations generated for the model decisions to the “ground-truth” explanations of the data point labels provided by humans. The authors argue that this is problematic because it suggests that explanations are evaluated according to their plausibility rather than faithfulness to the model’s internal reasoning. In effect, past works rated methods highly when they gave explanations of their behavior that sounded good, even if they did not actually reflect the true reasons that led the model to its decision. To resolve this shortcoming, they present a procedure for automatically measuring how well explanations would let an “observer” predict a model’s behavior (in a similar spirit to other work on model simulatability). For this procedure to capture the meaning in explanations and avoid rewarding trivial explanations that just restate the model behavior (e.g. “I picked answer A because it was the best choice”), a causal inference method is used to control for the influence of explanation “triviality” on the observer. In their experiments, the observer is a model proxy for a human.
Using two existings text datasets with human explanations, eSNLI and CoS-E, they evaluate existing methods and newly presented approaches, finding that some methods may not produce helpful explanations on average, while the most successful explanations come from a model that first generates hypothetical explanations for every answer choice, then makes predictions based on the input and all hypothetical explanations (and selects the final explanation based on the prediction). Further, since a metric for explanation quality is proposed, they also carry out experiments where models are optimized for this metric directly (or, rather, a proxy for this metric). These experiments are interpreted as multi-agent communication games played in natural language, and they find that in some settings this can improve the explanations' effect on model simulatability.
Findings of EMNLP
Developing More Easily Interpreted Models (6)
2018
This paper proposes an algorithm that minimizes the number of user studies needed to identify a model that is both accurate and interpretable according to some study-based criterion for interpretability. In their terminology, they query a human "prior" on model interpretability, p(M), via a single user study, and their algorithm solves for the MAP solution, maxM∈Mp(X|M)p(M), by using sequential user studies to search over models in M.
The procedure is as follows: First, they obtain a set of high likelihood models, i.e. models that explain the data well. Based on the view that model users might want a model to meet a minimum accuracy threshold, but not care much about exceeding this threshold, they define a likelihood function that sharply increases after a model meets an accuracy threshold. In practice, they have no preference between models as long as they meet the accuracy threshold. Second, they perform a user study that yields a single human-interpretability-score (HIS) for a given model and subset of data. They use the domain-general notion of simulatability in their studies, computed here as the mean response time (RT), which is the time it takes a user to predict a model output for a given input when relying on a summary/explanation of the model (elaborated on in next paragraph). For a higher HIS scores to represent more interpretable models, they compute the final HIS as the difference between a maximum allowed response time and the actual response time. The actual optimization occurs by an upper-confidence-bound based search over models, with UCBs given by a Gaussian Process over M. The kernel on models is the RBF kernel over feature importance estimates obtained for each model (by gradient-based feature importance for NNs, and some other procedure for decision trees). Starting with an initial model from the set of acceptably accurate models, the next user study is conducted using the model with the highest UCB estimate for its HIS.
The summary/explanation of a model is, for decision trees, given to users as a literal visualization of the tree. For neural networks, the summary/explanation is given by a locally estimated decision tree for each study data point. Similar to LIME, they sample from a perturbation around a data point and fit a decision tree to the NN's outputs on that perturbation distribution.
In their study, they run this algorithm for 10 iterations (ending with the 11th model), using graduate ML students and four tabular datasets (one synthetic). Before using graduate students, they ran a study with MTurkers, but found that the results were too noisy. They train decision trees for the three simpler datasets and use a neural network for the most complex dataset. They also compare to models obtained by optimizing for four easily computable heuristics for interpretability (in decision trees): number of nodes, mean path length, number of used features, and mean number of features in a path.
They find that: (1) while optimizing for a given interpretability heuristic (like number of decision tree nodes) often produces models that perform poorly under another interpretability heuristic, optimizing for their response-time HIS score produces models that tend to do well across interpretability heuristics, (2) under each of the four interpretability heuristics, their optimization algorithm finds more interpretable models more quickly than a random search over models, (3) their optimization does seem to increase HIS scores over time, which is the actual objective in the optimization. Regarding this last point, note that they "did not see statistically significant differences in our results," though their experiments may have been underpowered since they ran with 16 subjects divided across conditions.
NeurIPS
2018
The paper introduces an approach called CORELS that finds rule lists for data with a categorical feature space, where the rule lists are guaranteed to be optimal under the learning objective. Here, a rule list is a list of if-else statements (ending in just an if statement) of the form "if feature J == category C, predict Y, else..." and which serves as a classifier. The objective function is the empirical risk, with regularization on the length of the rule list (i.e. number of rules). The regularization penalty is directly interpretable as the trade-off between gaining p% of model accuracy at the cost of adding p rules to the model. The model is identified through a proposed branch-and-bound algorithm, which relies on a number of key observations that drastically reduce the size of the space of rule lists that needs to be searched.
Experiments involve several publicly available datasets for high-stakes tasks, including recidivism prediction and weapon possession in stop-and-frisk searches. Across tasks, the data points include 3-7 categorical attributes and up to 28 binary features.
The learned rule lists are 4 or 5 rules long, meaning it is very easy to read the entire rule list and see how it will handle every data point. The authors observe that on the recividism data, their approach achieves equal accuracy to a proprietary, blackbox "prediction tool" (COMPAS) used for recidivism prediction in some places in the US legal system.
Lastly, the authors note that the search algorithm may struggle with very high dimensional data where many possibly relevant features are highly correlated. They also suggest that, if desired, predicted probabilities can be obtained by taking the empirical probability of the predicted outcome for a given rule, and they remind the reader that their approach is not to be used naively for causal inference.
JMLR
2020
The authors propose to use a generalized additive model for a learning-to-rank task and make a few arguments that this kind of model is more interpretable than blackbox models for the task. A GAM takes the form f(xi)=∑di=1fd(xid), where a feature-specific function is applied to each feature of a data point separately. The model trades off being able to model feature interactions with the ease of understanding a particular feature's contribution to the final output: since each function is univariate, they can simply be plotted across the range of the input feature. In a learning-to-rank task, a given model input consists of a set of data points {xi}ni=1 along with some general context information for the set, qi. The authors extend the standard GAM to operate in the ranking setting by weighting the contribution of each feature with a weight obtained by another GAM on qi, as such: f(xi)=∑di=1gd(qi)fd(xid) where gd is the dth element of a vector-valued GAM g. Each fd is a separate neural network, as is g.
They compare their model with an existing tree-based GAM, and they compare across choice of loss function, between MSE and the ranking loss NDCG. They run experiments with three tabular datasets. In terms of NDCG performance, their neural GAM with a ranking loss performs the best, by between 0.3 and 4 points.
To evaluate interpretability, they consider three approaches. First, they check that their individual feature models fd seem to represent feature importance in the same sense that a standard feature-shuffling based approach (measured by change in NDCG from shuffling a feature column in the data). Specifically, they compute the difference in 5th and 95th percentile fd values (for corresponding two data points) as the "range" of the feature function, then check the correlation between these range values and the feature importance values obtained by shuffling. They find that using the ranking loss is important to ensure that there is a correlation between the range and feature-importance at all (but do not compare tree and neural GAMs). Second, they plot individual feature functions for the tree GAM and neural GAM. With both models, you can see how a feature contributes to the overall model output. Third, they plot a heatmap of gd values across a one-hot representation of a categorical feature, to see how, in this case, a region/country context feature yields weights for several x features; they observe some interesting structure in which some regions yield similar weights for certain features.
In two final experiments, they show how distilling the submodels into piece-wise linear models after training results in a 20x speed-up to inference at a small (1 point) cost to the objective. And they find that building a blackbox neural net on top of GAM features yields better NDCG by about 1 point over simply training a neural net from scratch. Their neural GAM performs about 2 points worse than the neural net trained from scratch.
arxiv
2020
This paper proposes to use neural module networks (NMNs) for purposes of increasing model interpretability. An NMN works by first parsing an input into a program, which is a series of functions to compute, then executing this program. For example, when a model must decide if a caption matches an image, an NMN might parse caption into the program "1. find dogs, 2. filter for black dogs, 3. check if number of entities from (1) and (2) are equal." These steps are executed by neural modules, and existing architectures and training procedures allow for NMNs to be trained end-to-end with supervision only at their final output.
NMNs could improve model interpretability over standard blackbox models by virtue of their modules executing human-interpretable functions, like finding, filtering, counting, comparing, etc. But it's a known result that training NMNs end-to-end leads to module "responsibilities" being diffusely distributed across several models, with some modules performing unintended functions. The authors confirm this finding for a visual+textual reasoning task (similar to the image captioning described above) and a purely textual reasoning task.
Principally, the authors study how to improve NMN "faithfulness," which they define as a property obtaining of models whose modules perform the roles they are intended for. They do so by providing programs to models (either gold programs or heuristically obtained programs) and collecting labels from people for what intermediate module outputs should be. To measure faithfulness, they measure the discrepancy between the module outputs and ground-truth outputs, for the provided programs.
They identify a few ways to improve faithfulness, usually at the expense of model accuracy. On visual+textual reasoning: By specifying modules exactly (i.e. simply summing the module inputs for a 'sum' module instead of fitting free parameters), they improve faithfulness slightly at the cost of 3 points of accuracy. Using inflexible, few-parameter modules further improves faithfuless over manual specification, at a smaller cost to accuracy. On textual reasoning, they find ways to decrease faithfulness. By training without two helpful kinds of modules (sorting and comparison modules), and thereby offloading these responsibilities onto other modules, they lower model faithfulness. And they observe that adding blackbox modules which can perform multiple "steps" at once can decrease overall model faithfulness.
For both reasoning tasks, they find that training on gold labels for module outputs greatly improves faithfulness. Interestingly, in one setting, this is actually pretraining on another dataset that has labels for module outputs rather than the dataset they evaluate with. The authors are hopeful that with careful architecture design and additional supervision, NMNs can provide more interpretability than standard models with a low cost to performance.
ACL
Robust and Adversarial Explanations (6)
2020
This paper shows that explanations can be generated for a blackbox bail recomendation system that lead law school students to be willing to deploy the model in practice, even when that model actually relies on race and gender as features and the students previously stated these features were unacceptable to use. Bail recommendation is essentially a prediction problem — by law, a defendent is released on bail only if a judge believes they will not flee or commit another crime — and there is historical data on outcomes for defendants who were released on bail. In this context, an "explanation" is a (simple) model of a blackbox model that predicts whether a defendent will flee or commit another crime. They use two-level decision sets obtained from MUSE as the explanations models, because they are highly intuitively interpretable. These are sets of if-then rules that can be used by a person to classify instances (see paper for 7 "interpretability" factors MUSE incorporates). The students unanimously suggested that race and gender were the unacceptable factors to use in a bail recommendation system, and mostly agreed that prior incarcerations and failures to appear before court were two features that should be relied on. So, the authors designed a blackbox model that relied on the prohibited features and made no use of the desired features — in fact, the model was secretly a two-level decision set that the authors fit, which achieved 83.28% accuracy on test data from a dataset of 32K examples in total. The authors then fit 3 explanation models, relying on (1) neither prohibited nor desired features, (2) both prohibited and desired features, (3) desired features but not prohibited features. The explanation models all approximate the blackbox model well, obtaining test accuracies upwards of 97%.
The user study design is to show participants one of the explanations (i.e., decision sets) for the blackbox model (not shown), then give them a binary question: "Based on this explanation, would you trust the underlying model enough to deploy it?" Users were explicitly instructed beforehand that the explanations were correlational and not causal. To check that the explanation model's accuracy at approximating the blackbox model (fidelity) will not be the source of variation in the question outcome, they confirm that each explanation obtains similar accuracy, ranging from 97.3% to 98.9%. Some users are shown the actual "blackbox" model (which is a known decision set) as the "explanation," which uses the prohibited but not the desired features (hence rounding out the 2x2 condition design for using prohibited and/or desired features).
Here are the rates at which users would deploy the model that relies on race/gender but not the desired features, after having seen an explanation model with the following features:
Thus, they conclude that explanations with very high fidelity can be constructed that lead users to be willing to deploy models that actually rely on features they do not want to be relied on, and exclude features they think should be included.
AIES
2020
Under the view of an explanation as a simple model approximating a complex model, this paper learns explanations that are robust to distribution shifts in the input space. In this view, distribution shift is particularly important to consider when explaining blackbox models because many of the model problems we wish to diagnose with explanations are related to distribution shift, like identifying model reliance on spurious features. They note that a consequence of optimizing for stability under distribution shifts is that they gain some robustness against adversarial inputs, and the approach can even help with explanation identifiability (i.e. identifiability of the simple model).
They formally introduce the set of distribution shifts they consider, which include shifts over a subset of the features, and they present an approximate objective to optimize for: the loss of the model under the worst case distribution shift in their possible shifts — this is the connection between robustness against distribution shifts and adversarial inputs. The approach is given for two kinds of explanation models, linear models and decision sets.
Experiments cover blackbox models of several tabular datasets, and they report the explanation fidelity (accuracy at predicting blackbox model outputs), correctness (similarity to the blackbox model, when the blackbox model is actually in the same model family as the explanation), and stability (similarity between a blackbox and explanation when the explanation is trained only on data from a shifted distribution). Under each of these metrics, they find that their approach far outperforms other explanation techniques, including simple models with standard training procedures as well as LIME and SHAP. Whereas the accuracy of a "global" explanation constructed from LIME (i.e. a set of LIME models covering the input space) drops by 14-19 percentage points under distribution shift, their analogous set of linear models drops only by 0-3.3 points. Likewise, using a distance metric on models in the same family, they find that their approach produces explanation models that are more similar to "blackbox" models (in the same family as the explanations) when they train explanations on either perturbed data points or data points from a shifted data distribution.
ICML
2020
The authors propose a simple search procedure for revealing inconsistent explanations generated across data points by a model that produces natural language explanations along with its label predictions for a task. For example, they find pairs of inputs that lead the model to produce the explanation "Snowboarding is done outside" for one input and "Snowboarding is not done outside" for the other input. This is a problem if you consider explanations to be beliefs actually held by a model and would like for model beliefs to be consistent across data point predictions, absent updates to the model. The method itself is the following procedure: (1) For a data point (x,y), get the model explanation e = explain(x,y). (2) Using templates/schema, generate a set of conflicting explanations that are inconsistent with e. (3) Use a "reverse" model of p(x|y,e) to generate proposal inputs. (4) Pass the proposal inputs through the model, and check if any of the resulting explanations are also members of the set of conflicting explanations (conflicting with the original explanation for the real data point). If one of these explanations is in the set, we've found a proposed input that leads the model to generate an explanation that is inconsistent with the explanation for the original input. The authors conduct experiments with the e-SNLI dataset (see "e-SNLI" paper in Natural Language Explanations), which includes human explanations for the SNLI dataset. In SNLI, a data point consists of a premise and a hypothesis pair, which must be classified into {neutral, entailment, contradiction} according to the relationship between them. Here, their "reverse" model generates proposal hypotheses only, since the premises are supposed to be taken at face value and different premises could naturally yield inconsistent explanations. Altogether, they use their search procedure with a model trained on e-SNLI and identify about 450 inconsistent explanations, starting with 9824 test points. They suggest that given the simplicity of their method and starkness of the inconsistencies identified, a success rate of 4.5% is far too high.
Note that the model the authors evaluate is actually of form: y = f(e), e = g(x). This is a workable model only because in SNLI, the form of the explanation (as opposed to its semantics) gives the label away around 97% of the time. For instance, "P implies Q" is almost always an explanation for the "entailment" label. So wherever above that y is conditioned on, technically they do not directly condition on y.
ACL Short Paper
Unit Testing (1)
Explaining RL Agents (8)
2020
This paper presents a series of formal definitions of what an explanation is in the context of structural causal models of an RL agent, then proposes a procedure for generating explanations of agent behavior. The authors' goal is to develop a procedure for explaining agents' actions themselves, rather than give explanations of why a state counts as evidence favoring some action. The definitions require some technical context, but roughly speaking: A structural causal model of an agent is a graph representing causal relationships between state, action, and reward nodes, with equations specifying each relationship in the graph. They define an action influence model as a causal graph plus a set of structural equations, with structural equations for each unique variable value and action pair (meaning multiple equations per variable value). Next, they say that (1) a complete explanation is the complete causal chain from an action to any future reward it leads to, (2) a minimally complete explanation is the set of parent nodes to an action, parent nodes to resulting rewards, and the rewards (so complete minus the nodes that aren't parents to rewards), (3) a counterfactual instantiation for a counterfactual action B is the condition under which the model would select action B and the condition resulting from this selection given the SCM, and, lastly, (4) a minimally complete contrastive explanation is an explanation which "extracts the actual causal chain for the taken action A, and the counterfactual causal chain for the B, and finds the differences."
They give an example minimally complete contrastive explanation for why a Starcraft-playing agent chooses to not build barracks (from a formal explanation plugged into a natural language template): "Because it is more desirable to do action Build Supply Depot to have more Supply Depots as the goal is to have more Destroyed Units and Destroyed buildings."
How do they generate these explanations? They learn models of the structural equations in their action influence model, conditioned on user-specified causal graphs, by fitting models to observed gameplay by an agent. With learned structural models, they give an algorithm for predicting the action an agent will take in a given state. From here, they can get explanations in the above forms. They validate the learned structural models by checking that they can predict what agents will do. Prediction accuracies range from 68.2 to 94.7 across six games, including Starcraft and OpenAI Gym environments.
Explanations are evaluated with a human subject experiment. They test two hypotheses: that receiving explanations will improve users' mental models of the agents, as measured by their ability to predict what the agent will do in a given state, and that explanations will improve trust, as measured by subjective reports on a Likert-scale. There are four conditions: (1) explanations come from their full explanation system, (2) they come from their system with more granular "atomic" actions, (3) explanations are based only on relevant variables, from prior work, given in the form "Action A is likely to increase relevant variable P" and (4) no explanations. They conduct experiments on Mechanical Turk with 120 users: after a training phase where participants learn what Starcraft-playing agents are doing, they enter a learning phase where they see 5 videos and after each are allowed to ask as many questions about the agent behavior as they'd like (in the form why/why-not action X). Next, they predict what the agent will do in 8 given situations. Lastly, users complete the trust battery, rating explanations based on whether they are complete, sufficient, satisfying, and understandable.
They find that given their explanation system, users are better able to predict agent behavior than in the "no explanation" or "relevant variables explanation" conditions. The improvement over the relevant variables condition is equivalent to getting one more action prediction correct out of 16 data points. Their results for the effect on trust are not statistically significant in all cases, but across the measured dimensions of trust their system improves ratings by between 0.3 and 1.0 points on their 5 point Likert scale.
AAAI
2018
From the paper's conclusion:
"We proposed a method for a reinforcement learning (RL) agent to generate explanations for its actions and strategies. The explanations are based on the expected consequences of its policy. These consequences were obtained through simulation according to a (learned) state transition model. Since state features and numerical rewards do not lend themselves easily for an explanation that is informative to humans, we developed a framework that translates states and actions into user-interpretable concepts and outcomes.
We also proposed a method for converting the foil, –or policy of interest to the user–, of a contrastive ’why’-question about actions into a policy. This policy follows locally the user’s query but gradually transgresses back towards the original learned policy. This policy favors the actions that are of interest to the user such that the agent tries to perform them as best as possible. How much these actions are favored compared to the originally learned action can be set with a single parameter.
Through running simulations for a given number steps of both the policy derived from the user’s question and the actually learned policy, we were able to obtain expected consequences of each. From here, we were able to construct contrastive explanations: explanations addressing the consequences of the learned policy and what would be different if the derived policy would have been followed.
An online survey pilot study was conducted to explore which of several explanations are most preferred by human users. Results indicate that users prefer explanations about policies rather than about single actions."
IJCAI XAI workshop
2019
With RL agents trained on Atari games, the authors aim to produce counterfactual states for a given state that an agent is in, which are defined as the closest states that result in a different action under the policy. This is done by learning a generative model of states conditioned on latent state representations and the policy network's distribution over actions. Then, a gradient-based search for a representation is performed to yield a different action under the policy, and a counterfactual state is generated from this representation. The authors argue that the policy model's latent space is too high dimensional for generation out of this space to produce coherent images. Hence, they learn a Wasserstein autoencoder on the policy model's latent space, and perform the search in this lower-dimensional space. Another training trick means that the state representations actually used for generation don't encode any information about a preferred action, unlike those in the policy network, so that the generator will meaningfully rely on the action distribution it is given. The overall generation procedure is as follows: Given a state and an agent, they pass the state through the policy network and then through the autoencoder to get a low-dimensional representation, then perform a gradient-based search in that space for the closest representation by L2 distance that yields a user-specified counterfactual action when decoded back into the policy model's latent space and transformed into a distribution over actions. A counterfactual state is generated conditioned on this new counterfactual distribution over actions and a representation of the original state.
The generations are evaluated by humans for two properties: realism and induced subjective understanding of the observed agent. After 30 human subjects (students and local community members) play Space Invaders for 5 minutes, they are asked to rate the realism of 30 images randomly chosen from a set including real gameplay images, counterfactual generations, and images from a heavily ablated version of their model without the autoencoder. On a scale of 1 to 6, real states received a 4.97 on average, counterfactual states a 4.0, and the ablated model's generations a 1.93. For the subjective user understanding test, participants were first shown a replay of an agent playing the game, then shown 10 pairs of states and counterfactual states (and associated actions for each), with counterfactual states selected to have large deviations from the original state. Users were asked to rate their "understanding of the agent" on a 1-6 scale before and after seeing these states. They found that 15 users' reported understandings improved, 8 declined, and 7 were constant (with a one-sided Wilcoxon signed-rank test for improvement: p=0.098).
IJCAI XAI Workshop
2019
The paper proposes a method for generating states with certain properties under a policy that are intended to be helpful with analyzing the policy. In particular, they identify states with large Q-values for certain actions, like hard braking by a simulated self-driving car, a large difference between best and worst Q-values (clear-cut situations), or low Q-values across actions (hopeless situations). They note that the immediate approach to doing this, for continuous states like in the Atari games they experiment with, is activation maximization of a Q-value (or function on Q-values) with respect to the input image, but they find that in practice this produces meaningless images outside of the natural state distribution, even when a variety of tricks are used. In response, they encode states in a low-dimensional space with a VAE and perform the activation maximization by gradient ascent in this embedding space. Interestingly, they search for the parameters of a distribution over embeddings, (μ,σ), rather than just a single embedding; later, they find that the results of the search allow them to generate samples using the VAE decoder. The VAE objective has a reconstruction loss (to generate realistic images) and a penalty on the reconstruction resulting in a different action from the original training image. They find that it is necessary to focus the reconstruction error on regions "important" to the agent, which means they weight the L2 reconstruction loss by a measure of pixel saliency obtained by applying a gradient-based saliency method to policy at a given state. The generator is trained with trajectories from a fixed agent.
They provide a great deal of qualitative analysis using their generated states. A few highlights include: In Seaquest, where the player must resurface from below water when an oxygen tank is low, they suggest that an agent does not understand that they must resurface when low on oxygen, after optimizing states for the Q-value of resurfacing. They note that while "it would be possible to identify this flawed behavior by analyzing the 10,000 frames of training data for our generator, it is significantly easier to review a handful of samples from our method." The generator can also yield examples not seen during training. With agents trained as simulated self-driving cars in an environment built by the authors, they find evidence of absence of the ability of a policy to avoid pedestrians: with a policy trained using "reasonable pedestrians" that never crossed while there was traffic, they observe that among states maximizing the Q-value of braking, states with pedestrians in the road are conspicuously absent. This policy shortcoming is then verified in a test environment where pedestrians cross while there is oncoming traffic, and they find that the agent will run over pedestrians.
arxiv
2020
From Alignment Newsletter #101:
This paper presents an analysis of the use of saliency maps in deep vision-based reinforcement learning on ATARI. They consider several types of saliency methods, all of which produce heatmaps on the input image. They show that all (46 claims across 11 papers) uses of saliency maps in deep RL literature interpret them as representing the agent's ""focus"", 87% use the saliency map to generate a claim about the agent's behaviour or reasoning, but only 7% validate their claims with additional or more direct evidence.
They go on to present a framework to turn subjective and under-defined claims about agent behaviour generated with saliency maps into falsifiable claims. This framework effectively makes the claim more specific and targeted at specific semantic concepts in the game's state space. Using a fully parameterized version of the ATARI environment, they can alter the game's state in ways which preserve meaning (i.e. the new state is still a valid game state). This allows them to perform interventions in a rigorous way, and falsify the claims made in their framework.
Using their framework, they perform 3 experimental case studies on popular claims about agent behaviour backed up by saliency maps, and show that all of them are false (or at least stated more generally than they should be). For example, in the game Breakout, agents tend to build tunnels through the bricks to get a high score. Saliency maps show that the agent attends to these tunnels in natural games. However, shifting the position of the tunnel and/or the agent's paddle and/or the ball all remove the saliency on the tunnel's location. Even flipping the whole screen vertically (which still results in a valid game state) removes the saliency on the tunnel's location. This shows that the agent doesn’t understand the concept of tunnels generally or robustly, which is often what is claimed.
ICLR
2020
From Alignment Newsletter #128:
This work presents an interface for interpreting the vision of a reinforcement learning agent trained with PPO on the CoinRun game. This game is procedurally generated, which means the levels are different in every episode of playing. The interface primarily uses attribution from a hidden layer to the output of the value function. This interface is used in several ways.
First, they use the interface to dissect failed trajectories of the policy (it fails in 1 out of 200 levels). They're able to understand why the failures occurred using their interface: for example, in one case the view of the agent at the top of its jump means it can't see any platforms below it, so doesn't move to the right fast enough to reach the platform it was jumping for, leading it to miss the platform and fail the level. Second, they use the interface to discover "hallucinations", where the value function mistakes one element of the environment for another, causing its value to drop or rise significantly. Often these hallucinations only last a single time-step, so they don't affect performance.
Finally, they use the attributions specifically to hand-edit the weights of the model to make it "blind" to buzzsaws (one of the hazards) by zeroing the feature which recognises them. After doing this, they show that the edited agent fails a lot more from buzzsaw failures but no more from other types of failures, which gives a quantitative justification for their interpretation of the feature as buzzsaw-recognising.
From using this interface, they propose the diversity hypothesis: Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction). This is based on the fact that interpretable features arise more when the agent is trained on a wider variety of levels. There also seems to be a qualitative link to generalisation - a wider distribution of training levels leads to better interpretability (measured qualitatively) and better generalisation (measured quantitatively).
Distill
2021
From Alignment Newsletter #141:
A common challenge when understanding the world is that it is very hard to infer causal structure from only observational data. Luckily, we aren’t limited to observational data in the case of AI systems: we can intervene on either the environment the agent is acting in, or the agent itself, and see what happens. In this paper, the authors present an “agent debugger” that helps with this, which has all the features you’d normally expect in a debugger: you can set breakpoints, step forward or backward in the execution trace, and set or monitor variables.
Let’s consider an example where an agent is trained to go to a high reward apple. However, during training the location of the apple is correlated with the floor type (grass or sand). Suppose we now get an agent that does well in the training environment. How can we tell if the agent looks for the apple and goes there, rather than looking at the floor type and going to the location where the apple was during training?
We can’t distinguish between these possibilities with just observational data. However, with the agent debugger, we can simulate what the agent would do in the case where the floor type and apple location are different from how they were in training, which can then answer our question.
We can go further: using the data collected from simulations using the agent debugger, we can also build a causal model that explains how the agent makes decisions. We do have to identify the features of interest (i.e. the nodes in the causal graph), but the probability tables can be computed automatically from the data from the agent debugger. The resulting causal model can then be thought of as an “explanation” for the behavior of the agent.
arxiv
Interpretability in Practice (2)
2020
The authors introduce a browser-based GUI for exploring NLP model behavior, intended to enable researchers to answer questions like: (1) why did a model make this prediction? (2) on what data points does the model perform poorly? and (3) what happens to behavior under controlled changes to the model input? The interface services a variety of models, including classification, sequence to sequence, and structured prediction models. Features include mechanisms for (1) exploring your dataset, (2) finding interesting data points and outliers, (3) explaining local model behavior through LIME and salience maps, (4) generating new data points by backtranslation, word substitutions, and adversarial attacks, (5) side-by-side comparison of two models, and (6) computing metrics on selections of datapoints or automatically-selected slices of the data. In case studies with sentiment analysis classifiers, coreference models, and text generation, they identify several cases of model pathologies and identify possible causes of the behavior. For instance, with an errant text generation from T5, they do nearest neighbor lookups based on decoder embeddings and find that a number of similar points used a certain phrase structure, which may have biased that model to (incorrectly) repeating that phrase structure in a new setting.
A few design principles guided the system development, including flexibility, extensibility, modularity, agnosticism to deep learning framework, and ease of use. The end product is one which the authors hope will be easy for researchers to interact with and build a better understanding of their models. In comparison to tools like AllenNLP Interpret, the authors note that their preference for a framework-agnostic GUI means that they are more easily able to provide analysis through methods that handle arbitrary functions (like LIME) rather than methods that require full access to model internals, like Integrated Gradients.
EMNLP
Additional Papers
We provide some additional papers here that we did not summarize above, including very recent papers, highly focused papers, and others. These are organized by the same topic areas as above.
Theory and Opinion (12)
Evaluation (10)
Methods: Estimating Feature Importance (16)
Methods: Interpreting Representations and Weights (7)
Methods: Generating Counterfactuals and Recourse Procedures (8)
Methods: Explanation by Examples, Exemplars, and Prototypes (4)
Methods: Finding Influential Training Data (4)
Methods: Natural Language Explanations (9)
Methods: Developing More Easily Interpreted Models (3)
Methods: Robust and Adversarial Explanations (5)
Explaining RL Agents (2)
Datasets and Data Collection (1)
Interpretability in Practice (2)
Conclusion
We hope this post can serve as a useful resource and help start important conversations about model interpretability and AI Safety. As mentioned, please let us know if you noticed any mistakes or think we missed anything that could improve the post.