Processing math: 68%
+ - 0:00:00
Notes for current slide
Notes for next slide

An Overview of Visualization Techniques for Explainable Machine Learning

Katherine Goode - ISU Graphics Group - April 10, 2020

1 / 56

Machine Learning

Machine learning models may provide magical predictions,...

2 / 56

Black Box Models

...but being able to explain how many machine learning models produce the predictions is not an easy task.

3 / 56

The Importance of Explanability

4 / 56

Literature on Explanability

General trends I've noticed:

  • Many recent papers

  • Often machine learning and computer science perspectives

  • Lots of European authors

    • General Data Protection Regulation (GDPR) implemented in 2018
    • Goodman and Flaxman (2016): "It is reasonable to suppose that any adequate explanation would, at a minimum, provide an account of how input features relate to predictions, allowing one to answer questions such as: Is the model more or less likely to recommend a loan if the applicant is a minority?"

Key resources for this talk:

5 / 56

The Plan...

Setting the Stage

  • Definitions and Philosophical Aspects

Methods

  • Model Agnostic

  • Random Forest Specific

  • Neural Network Specific

Concluding Thoughts

  • Additional Methods and Resources

  • A Cautionary Conclusion

6 / 56

Definitions and Philosophical Aspects

7 / 56

Explainability versus Interpretability

There are not agreed upon definitions...

Interpretable Machine Learning (Molnar 2020)

  • "I will use both the terms interpretable and explainable interchangeably"
  • "I will use “explanation” for explanations of individual predictions."

Methods for Interpreting and Understanding Deep Neural Networks (Montavon, Samek, and Muller 2017)

  • "post-hoc interpretability, i.e. a trained model is given and our goal is to understand what the model predicts (e.g. categories) in terms what is readily interpretable (e.g. the input variables)"
  • "Post-hoc interpretability should be contrasted to incorporating interpretability directly into the structure of the model..."
  • "...when using the word “understanding”, we refer to a functional understanding of the model, in contrast to a lower-level mechanistic or algorithmic understanding of it."
  • also distinguish between interpretation and explanation

The Mythos of Model Interpretability (Lipton 2017)

  • Paper dedicated to the philosophical discussion of what interpretability is in machine learning

Explaining Explanations: An Overview of Interpretability of Machine Learning (Gilpin et. al. 2019)

  • "We take the stance that interpretability alone is insufficient. In order for humans to trust black-box methods, we need explainability – models that are able to summarize the reasons for neural network behavior, gain the trust of users, or produce insights about the causes of their decisions"
  • Implies that you need both interpretability and explainability?
8 / 56

Explainability versus Interpretability

My definitions (based on a conversation with Nick Street (University of Iowa))...

Interpretability = the ability to directly use the parameters of a model to understand the mechanism of how the model makes predictions

  • a linear model coefficient: indicates the amount the response variable changes based on a change in the predictor variable


ˆy=ˆβ0+ˆβ1x1++ˆβpxp

Explainability = the ability to use the model in an indirect manner to understand the relationships in the data captured by the mode

  • LIME: model agnostic method that uses a surrogate model
Figure from LIME paper (Ribeiro 2016)

Figure from LIME paper (Ribeiro 2016)

9 / 56

Should we explain black-box models?

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead by Cynthia Rudin:

  • Debunks the “accuracy-interpretability trade-off” myth

  • "Explanations must be wrong. They cannot have perfect fidelity with respect to the original model. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model..."

  • "...it is possible that the explanation leaves out so much information that it makes no sense."

  • Rudin has worked on developing machine learning models with direct interpretability

10 / 56

Model Agnostic Methods

11 / 56

Overview of Model Agnostic Methods

Advantages

  • Can be applied to any model

  • Convenient if comparing various types of predictive models


Disadvantages

  • Must work with any model
From Interpretable Machine Learning (Molnar)

From Interpretable Machine Learning (Molnar)

12 / 56

Washington D.C. Bike Rentals

Example data in Interpretable Machine Learning - can be accessed here

bike <- load("data/bike.RData")
data(bike)
# Fit a random forest
bike_mod = randomForest::randomForest(x = bike %>% dplyr::select(-cnt), y = bike$cnt)
13 / 56

Model Agnostic Methods:

Prediction Visualizations

14 / 56

Partial Dependence Plots Friedman 2001

Purpose: Visualize marginal relationship between one (or two) predictors and model predictions

Estimated partial dependence function:

ˆfxint(xint)=1nni=1ˆf(xint,x(i)other)

  • ˆf= machine learning model fit using predictor variables
  • xint= value of the predictor of interest
  • x(i)other= vector of training data values of other predictors in the model for observation i

15 / 56

Partial Dependence Plots in iml

# Create a "predictor" object that
# holds the model and the data
bike_pred =
Predictor$new(
model = bike_mod,
data = bike)
# Compute the partial dependence
# function for temp and windspeed
pdp =
FeatureEffect$new(
predictor = bike_pred,
feature = c("hum", "temp"),
method = "pdp")
# Create the partial dependence plot
pdp$plot() +
viridis::scale_fill_viridis(
option = "D") +
labs(x = "Humidity",
y = "Temperature",
fill = "Prediction")

Partial dependence plot with two variables

16 / 56

Interactive PDPs Krause, Perer, and Ng 2016

17 / 56

Individual Conditional Expectation Goldstein et. al. 2013

Purpose: Similar to partial dependence plots, but consider each observation separately instead of taking an average.

Estimated individual conditional expectation function:

ˆf(i)xint(xint)=ˆf(xint,x(i)other)

  • ˆf= machine learning model fit using predictor variables
  • xint= value of the predictor of interest
  • x(i)other= vector of training data values of other predictors in the model for observation i

18 / 56

ICE Plots in iml

# Compute the ICE function
ice =
FeatureEffect$new(
predictor = bike_pred,
feature = "temp",
method = "ice")
# Create the plot
plot(ice)

ICE plot for temperature

19 / 56

Centered ICE Plots

"Sometimes it can be hard to tell whether the ICE curves differ between individuals because they start at different predictions. A simple solution is to center the curves at a certain point in the feature and display only the difference in the prediction to this point." - Molnar

# Center the ICE function for
# temperature at the
# minimum temperature and
# include the pdp
ice_centered =
FeatureEffect$new(
predictor = bike_pred,
feature = "temp",
center.at = min(bike$temp),
method = "pdp+ice")
# Create the plot
plot(ice_centered)

20 / 56

Addressing Correlation and Interactions

Accumulated Local Effects (ALE) Plots
Apley and Zhu (2016)

  • Similar to PDPs: consider how a feature influences predictions on average
  • In contrast to PDPs: faster to create and account for correlation

Feature Interaction Plots

21 / 56

ALE and Interaction Plots in iml

Code for the plots on the previous slide

Accumulated Local Effects (ALE) Plot

# Compute the ALEs
ale = FeatureEffect$new(
predictor = bike_pred,
feature = c("hum", "temp"),
method = "ale",
grid.size = 40)
# Plot the ALEs
plot(ale) +
scale_x_continuous(
"Relative Humidity") +
scale_y_continuous(
"Temperature")+
viridis::scale_fill_viridis(
option = "D") +
labs(fill = "ALE")

Feature Interaction Plot

# Compute the interaction metrics
int =
Interaction$new(
predictor = bike_pred,
grid.size = 100,
feature = "season")
# Plot the interaction metrics
plot(int) +
scale_x_continuous(
"2-way interaction strength")
22 / 56

Parallel Coordinate Plots

Provide a nice overview of the predictions across

PCP plot with bike data (made with ggpcp)

23 / 56

Parallel Coordinate Plots in ggpcp

Code for the plot on the previous slide

# Determine order of features
bike_ft_ordered = bike_vi %>% arrange(desc(IncNodePurity)) %>% pull(var)
# Create the pcp
bike %>%
mutate(rf_pred = predict(bike_mod)) %>%
ggplot(aes(color = rf_pred)) +
ggpcp::geom_pcp(aes(vars = dplyr::vars(all_of(bike_ft_ordered))), alpha = 0.4) +
viridis::scale_color_viridis(option = "D") +
labs(x = "Featured ordered by feature importance (left to right)",
y = "Standardized Feature Value",
color = "Random Forest Prediction") +
theme(legend.position = "bottom") +
guides(color = guide_colourbar(barwidth = 15))
24 / 56

Interactive Parallel Coordinate Plots Beckett (2018)

R package [Rfviz] for interactive parallel coordinate plots with random forest models, but it could be extended to other machine learning models.

25 / 56

Model Agnostic Methods:

Feature Importance

26 / 56

Permutation Feature Importance

Background

Concept

  • Measure feature importance by seeing how much the prediction error is affected when a feature is permuted

    • important feature: one that affects the prediction error when changed

    • non-important feature: one that does not affect the prediction error when changed

27 / 56

Permutation Feature Importance in iml

Permutation feature importance of bike data random forest

# Create the predictor
# (seemingly FeatureImp
# requires y)
bike_pred =
Predictor$new(
model = bike_mod,
data = bike,
y = bike$cnt)
# Compute the feature
# importance values
bike_imp =
FeatureImp$new(
predictor = bike_pred,
loss = 'mae')
# Plot the feature
# importance values
plot(bike_imp)

Point = median permutation importance
Bars = 5th and 95th permutation importance quantiles

28 / 56

Permutation FI with p-values Altmann et. al. (2010)

  • Permutation based feature importance method that returns p-values

  • Example from the paper comparing Gini importance values to their permutation feature importance method with p-values

29 / 56

More Feature Importance Casalicchio, Molnar, and Bischl (2019)

Three additional measures for feature importance:

Individual Conditional Importance (ICI)

  • "local" permutation feature importance metric
  • similar to ICE plots but "visualize the expected (conditional) feature importance instead of the expected (conditional)" prediction"

Partial Importance (PI)

  • aggregate of ICI values

Shapley Feature Importance (SFIMP)

  • based on Shapley values

ICI and PI available in the featureImportance R package

30 / 56

Model Agnostic Methods:

Surrogate Models

31 / 56

Global Surrogate Models

Idea: Use an interpretable model to explain a black-box model

Procedure:

  1. Train a black-box model

  2. Obtain predictions from black-box model on a set of data (training data or other)

  3. Fit an interpretable model (linear regression model, tree, etc)

black-box predictions ~ predictor variables

Cautions: How to know if the global surrogate is a good enough approximation of the complex model?

32 / 56

Using a Tree as the Global Surrogate

Using a classification tree as the global surrogates for the random forest model fit to the sine data

33 / 56

Local Surrogate Model: LIME Ribeiro, Singh, Guestrin (2016)

LIME = Local Interpretable Model-Agnostic Explanations

  • Consider one prediction of interest
  • Use a surrogate model to explain the black-box model in a "local" region about a point of interest

34 / 56

LIME in R

lime

  • written by Thomas Pedersen
  • package for implementing LIME

limeaid

  • written by me 😄
  • package for visually understanding and assessing LIME

35 / 56

Model Agnostic Methods:

Game Theory Based Method

36 / 56

Shapley Values Štrumbelj and Kononenko (2014)

Idea: Use game theory to determine contributions of predictor variables to one prediction of interest

Game Theory Connection: Shapley values are "a method for assigning payouts to players depending on their contribution to the total payout."

Game Theory Term Machine Learning Meaning
collaborative game machine learning model prediction for one prediction
players predictor variables
payout contribution of a predictor variable to the prediction
gain actual prediction - average prediction for all instances

37 / 56

Shapley Values in iml

Interpretation: "The value of the jth feature contributed ϕj to the prediction of this particular instance compared to the average prediction for the dataset."

# Select obs of interest and perpare data
x_int = bike[names(bike) != 'cnt'][285,]
# Compute prediction values
avg_pred = mean(predict(bike_mod))
actual_pred = predict(bike_mod, newdata = bike[names(bike) != 'cnt'][285,])
diff_pred = actual_pred - avg_pred
# Compute shapley values
predictor =
Predictor$new(model = bike_mod, data = bike[names(bike) != 'cnt'])
shapley =
Shapley$new(predictor = predictor,
x.interest = x_int)
# Create the plot
plot(shapley) +
scale_y_continuous("Feature value contribution") +
ggtitle(sprintf("Actual prediction: %.0f\nAverage prediction: %.0f\nDifference: %.0f", actual_pred, avg_pred, diff_pred))

Shapley values for one observation from the bike rental random forest

38 / 56

Random Forest Specific Techniques

39 / 56

Quick intro to random forests

Idea: Aggregation of many trees (bootstrap data and randomly select predictors for each tree)

40 / 56

Feature Importance Plot

Mean decrease in impurity (gini importance): measures the average improvement in node purity for a predictor variable

# Extract the importance values
bike_rfimp <- bike_mod$importance
# Put the feature importance in a df
bike_vi <-
data.frame(var = rownames(bike_rfimp),
bike_rfimp) %>%
arrange(IncNodePurity)
# Create a feature importance plot
bike_vi %>%
mutate(var = factor(x = var, levels = bike_vi$var)) %>%
ggplot(aes(x = var,
y = IncNodePurity)) +
geom_col() +
coord_flip() +
labs(x = "Feature")

Bike random forest feature importance plot

41 / 56

Visualizing Sets of Trees Simon Urbanek (2008)

Cut points from all trees for two predictor variables

42 / 56

Visualizing Sets of Trees Simon Urbanek (2008)

Trace plots of all trees in a random forest

43 / 56

ggRandomForests Ehrlinger (2015)

R package for visually exploring random forests fit using randomForests or randomForest

library(ggRandomForests)

Out-of-bag errors versus number of trees

plot(gg_error(bike_mod)) + theme_gray()

Variable importance plot

plot(gg_vimp(bike_mod)) + theme_gray()

44 / 56

rfviz Beckett (2018)

Previously mentioned...R package for interacting with parallel coordinate plots for random forests

# Prepare data
rfprep <- rfviz::rf_prep(x = bike[names(bike) != "cnt"], y = bike$cnt)
# View plots
rfviz::rf_viz(rfprep, input = TRUE, imp = TRUE, cmd = TRUE, hl_color = 'black')

45 / 56

Forest Floor Visualizations Welling et. al. (2016)

  • Method that creates plots similar to partial dependence plots

  • From the paper:

    "We suggest to first use feature contributions, a method to decompose trees by splitting features, and then subsequently perform projections. The advantages of forest floor over partial dependence plots is that interactions are not masked by averaging."

  • R package: forestFloor

    • I struggled to get it to work...

Forest floor plots (figure from the paper)

46 / 56

Neural Network Specific Techniques

47 / 56

Quick intro to random forests

Idea: Combination of many non-linear regression models

Image source

48 / 56

Feature Visualization overview article by Olah, Mordvintsev, and Schubert (2017)

Idea: Determine values of predictor variables that maximize activation functions at a specific "location" in the neural network

Formula Version: For a node in the network:    argmax

  • x = values of predictor variables
  • w_j = estimated weights at node j
  • f_j = activation function used at node j

Image from Olah, Mordvintsev, and Schubert (2017)

49 / 56

Saliency Maps Simonyan, (2014)

Purpose: To identify the features that are important for making a prediction for a single observation

Concept: Makes use of back-propagation algorithm to determine gradient values associated with a predictor variable which indicate how much a predictor variable influences the prediction

In practice:

  • Commonly used with convolutional neural networks to identify important pixels in an image

  • Many algorithms for creating saliency maps

50 / 56

Grand Tours Li, Zhao, and Scheidegger (2020)

Idea: Make use of the Grand Tour to visualize behaviors of neural networks

Image from Simonyan, Vedaldi, and Zisserman (2014)

51 / 56

Additional Methods and Resources

52 / 56

More Methods

Model Agnostic Methods

Example-Based Explanations

  • Counterfactual examples

  • Adversarial examples

  • Prototypes and criticisms

  • Influential instances

Model Specific

General Model Viz


Many more....

53 / 56

Overviews

Additional Resources for Overviews for Explainable Machine Learning

Gilpin et. al. (2019)

  • Explaining Explanations: An Overview of Interpretability of Machine Learning

Mohseni, Zarei, and Ragan (2019)

  • A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems

Ming (2017)

  • A Survey on Visualization for Explainable Classifiers

Guidotti et. al. (2018)

  • A Survey Of Methods For Explaining Black Box Models
54 / 56

A Cautionary Conclusion

55 / 56

Some thoughts on EML

Review of method types

  • model agnostic versus model specific

  • global versus local explanations

  • static versus interactive

  • models versus metrics

Good News

  • many methods to try out

  • lots of research opportunities

  • opportunity for creating useful visualizations

Cautions

  • this is a relatively new field

  • unsure which are the most trusted methods

  • a seemingly simple method may not be so simple

  • model based methods

    • add an additional layer of complexity to an already complex situation

    • almost seems naive to expect a simple model to capture the complex relationship in a black-box model

56 / 56

Machine Learning

Machine learning models may provide magical predictions,...

2 / 56
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow