Paper session 3: Explanation

Different “Intelligibility” for Different Folks

ABSTRACT. Many arguments have concluded that our autonomous technologies must be intelligible, interpretable, or explainable, even if that property comes at a performance cost. In this paper, we consider the reasons why some property like these might be valuable, we conclude that there is not simply one kind of ‘intelligibility’, but rather different types for different individuals and uses. In particular, different interests and goals require different types of intelligibility (or explanations, or other related notion). We thus provide a typography of ‘intelligibility’ that distinguishes various notions, and draw methodological conclusions about how autonomous technologies should be designed and deployed in different ways, depending on whose intelligibility is required.

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

ABSTRACT. As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of a given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

Human Comprehension of Fairness in Machine Learning

ABSTRACT. Bias in machine learning has manifested injustice in several areas, such as medicine, hiring, and criminal justice. In response, computer scientists have developed myriad definitions of fairness to correct this bias in fielded algorithms. While some definitions are based on established legal and ethical norms, others are largely mathematical. It is unclear whether the general public agrees with these fairness definitions, and perhaps more importantly, whether they understand these definitions. We take initial steps toward bridging this gap between ML researchers and the public, by addressing the question: does a non-technical audience understand a basic definition of ML fairness? We develop a metric to measure comprehension of one such definition–demographic parity. We validate this metric using online surveys, and study the relationship between comprehension and sentiment, demographics, and the application at hand.

Good Explanation for Algorithmic Transparency

ABSTRACT. Machine learning algorithms have gained widespread usage across a variety of domains, both in providing predictions to expert users and recommending decisions to everyday users. However, these AI systems are often black boxes, and end-users are rarely provided with an explanation. The critical need for explanation by AI systems has led to calls for algorithmic transparency, including the “right to explanation” in the EU General Data Protection Regulation (GDPR). These initiatives presuppose that we know what constitutes a meaningful or good explanation, but there has actually been surprisingly little research on this question in the context of AI systems. In this paper, we (1) develop a generalizable framework grounded in philosophy, psychology, and interpretable machine learning to investigate and define characteristics of good explanation, and (2) conduct a large-scale lab experiment to measure the impact of different factors on people’s perceptions of understanding, usage intention, and trust of AI systems. The framework and study together provide a concrete guide for managers on how to present algorithmic prediction rationales to end-users to foster trust and adoption, and elements of explanation and transparency to be considered by AI researchers and engineers in designing, developing, and deploying transparent or explainable algorithms.

“How do I fool you?”: Manipulating User Trust via Misleading Black Box Explanations

ABSTRACT. As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. It has recently become apparent that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black-box models. More specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.