This journal visualizes and compares the results from the different LIME implementations.

# Load libraries
library(gretchenalbrecht)
library(lime)
library(randomForest)
#library(seriation)
library(tidyverse)

# Load in the training and testing data 
hamby173and252_train <- read.csv("../../../data/hamby173and252_train.csv")
hamby224_test <- read.csv("../../../data/hamby224_test.csv")

# Read in the lime explanations
hamby224_test_explain <- readRDS("../../../data/hamby224_test_explain.rds")
hamby224_sensitivity_joined <- readRDS("../../../data/hamby224_sensitivity_joined.rds")

# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(bulletr::rtrees$importance)

When we began looking at the explanations from lime with the default settings, we did not think that they made sense. This led me to try applying LIME with a handful of input values. I also have tried running LIME with the tree based bins. Additionally, since LIME is based on random permutations, I was curious to know how consistent the results were. This led me to try running each implementation a handful of times. I have attempted to evaluate these results by both considering the accuracy and consistency of the results.

Random Forest

Bar Charts of Bins

Heike made a plot with the structure of the one below when we began with the default settings of lime. This one has been created from the lime explanations with 2 quantile bins. It includes all three of the chosen features for each case in the testing dataset. For both sets, the cutoff of 0.275 < ccf occurs the most frequently.

# Plot of fequency for each bin division for 2 quantile bins
hamby224_test_explain %>%
  filter(situation == "2 quantile") %>%
  ggplot(aes(x = feature_desc)) +
  geom_bar() +
  coord_flip() + 
  facet_grid(set ~ .)

One Implementation

Accuracy

The figures in this section are created from the implementations of LIME on set 1 from the training data for the input options of 2 to 6 quantile bins, 2 to 6 equally spaced bins, 2 to 6 tree based bins, kernel density estimation, and normal distribution approximation. Each set of input values was only run once.

# Read in the lime comparison data (the comparisons are run in the firearm examiner
# paper since they will be discussed in the paper)
hamby224_lime_comparisons <- readRDS("../../../data/hamby224_lime_comparisons.rds")

Complex versus Simple Model Predictions

The plot below compares the predictions from the “simple model” (the ridge regression model) and the “complex model” (the random forest rtrees) on the x-axis from the lime implementations with bin estimation (bin_continuous == TRUE). The simple model predictions are on the y-axis, and the complex model predictions are on the x-axis. The plot is faceted by number of bins and the type of bin. The points are colored by the \(R^2\) value from the fit of the simple model. The lines are linear regression lines fit to the data points within a facet. We would hope that there is a linear relationship between these two variables. None of the cases show strong linear trends, but some are more linear than others. The quantile bins show that the simple model never makes a prediction over 0.6, whereas the random forest model can have predictions of up to 1. The equally spaced bins do have probabilities that exceed 0.6, but only with 3 and 6 bins. The tree based bins do the best job with predictions above 0.6 for all number of bins. The tree bins also seem to group the simple model predictions into two categories (probably based on the samesource variable). I noticed that within the facets, the points are in mostly horizontal strips, and the number of strips is about the same as the number of bins from the lime implementation.

# Plot of simple model versus complex model predictions
hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = model_prediction)) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_situation ~ nbins) + 
  theme_bw() + 
  stat_smooth(method = "lm", se = FALSE, size = 0.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "Complex Model Prediction (RF Score)", y = "Simple Model Prediction",
       color = "Simple Model R2")

The plot below shows the absolute value of the difference between the complex model prediction and the simple model prediction versus the complex model prediction. The points are faceted by number of bins and bin type. Again, the points are colored by the \(R^2\) values from the simple model. All cases show a v-shaped trend. The low part of the v occurs around a random forest score of about 0.25. That is, the simple model is most accurately portraying the complex model predictions around the random forest score of 0.25. It gets worse near the extremes. However, it is interesting to note that with the equally spaced bins and the tree based bins, the absolute difference decreases near 1. This is not the case with the quantile bins. It appears that the equally spaced bins and tree based bins are able to make slightly better predictions than the quantile bins when there is a high probability of a match.

# Plot of difference in predictions versus rfscore
hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = abs(diff))) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_situation ~ nbins) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Complex Model Prediction (RF Score)", 
       y = "Difference Between Complex and Simple Model Predictions",
       color = "Simple Model R2")

At some point, I got the idea in my head that it would be interesting to look at the “residuals” (difference in complex and simple model predictions) by feature. The plot below shows one example with the data from 3 equally spaced bins. There are clear trends in the plots, but I am not quite sure what to make of this or how to use it. This may be something to return to.

# Plots of model "residuals"
hamby224_lime_comparisons %>%
  filter(quantile_bins == FALSE, bin_method == "equally_spaced", nbins == "3", set == "Set 1") %>%
  select(rf_features, diff, -case) %>%
  gather(key = feature, value = feature_value, rf_features) %>%
  select(feature, feature_value, diff) %>%
  ggplot(aes(x = feature_value, y = diff)) + 
  geom_point() + 
  facet_wrap(~ feature, scales = "free") + 
  geom_hline(yintercept = 0, color = "blue") + 
  labs(x = "Feature Value", y = "Residual")

Comparing Input Values by MSE and \(R^2\)

In order to assess which lime implementation is doing the best job of capturing the predictions from the random forest model, I decided to calculate the mean squared error for each of the input situations. I defined the mean squared error as \[\frac{\sum_{i=1}^n (\hat{p}_{\mbox{simple},i}-\hat{p}_{\mbox{complex},i})^2}{n}\] where \(n\) is the number of observations in the testing dataset (within a set). Additionally, I was curious to compare the fits of the ridge regression models across input values, so I calculated the average \(R^2\) for each input situation.

# Summarizing lime results
hamby224_lime_results <- hamby224_lime_comparisons %>%
  filter(!is.na(rfscore)) %>%
  group_by(situation, bin_situation, bin_continuous, quantile_bins, nbins, use_density, 
           bin_method, response, set) %>%
  summarise(mse = (sum(diff^2)) / length(diff),
            ave_r2 = mean(model_r2)) %>%
  arrange(set) %>%
  ungroup()

The plot below shows the mean squared errors for each of the input situations faceted by set. In both cases, the mean squared error with 3 equally spaced bins is low, and the MSE for the tree binning cases are also low.

# Plot of MSEs
hamby224_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = mse, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") + 
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The plot below shows the average \(R^2\) value for each of the input situations faceted by set. The results are very similar for each set, and the highest \(R^2\) values occur for 2 quantile bins. The \(R^2\) values for 3 equally spaced bins are in the middle, and the \(R^2\) values for all of the tree based bins are the lowest.

# Plot of R^2s
hamby224_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = ave_r2, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "Bin Type", y = "Average R2", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

Consistency

I wanted to look at the consistency of features chosen by lime across the number of bins within different bin types. I would like the features chosen to not depend on the number of bins used. Otherwise, it will be difficult to know how many bins to use. Additionally, it would be nice if the features chosen differed across cases, so that the explanations are based on the locality of the observation and not general across all locations. The code below creates a dataset that identifies the order in which the features are chosen by lime.

# Creates a dataset that creates columns of the chosen features by lime from first to third
chosen_features <- hamby224_test_explain %>%
  filter(!(bin_situation %in% c("kernel density", "normal approximation"))) %>%
  filter(!is.na(rfscore)) %>%
  select(set, situation, bin_situation, nbins, case, samesource, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, desc(feature_weight_abs)) %>%
  mutate(explainer = rep(c("first", "second", "third"), length(situation) / 3)) %>%
  select(-feature_weight, -feature_weight_abs) %>%
  spread(explainer, feature)

Fleiss’s Kappa

In order to access consistency, I have decided to compute Fleiss’ kappa for each of the bin types. Fleiss’ kappa is often used to assess the consistency across raters for any number of raters and with nominal ratings. It calculates the degree of agreement in classification over that which would be expected by chance. A description of how it is computed can be found on the wikipedia site: https://en.wikipedia.org/wiki/Fleiss%27_kappa

I computed kappa for each bin situation separately for the top three features chosen. These values are plotted below for each set. For both the first and second features chosen by lime, the values of kappa suggest that quantile bins are the most consistent across all the number of bins followed by the tree based bins. The values of kappa for the quantile bins are slight to fair agreement. For the third chosen variable, the tree based bins have the highest value of kappa, but all the kappa values for the third chosen feature are low.

# Computes and plots the values of kappa
chosen_features %>%
  ungroup() %>%
  select(-situation) %>%
  gather(key = chosen, value = feature, first:third) %>%
  spread(nbins, feature) %>%
  select(-case) %>%
  rename(bins2 = "2", bins3 = "3", bins4 = "4", bins5 = "5", bins6 = "6") %>%
  group_by(set, bin_situation, chosen) %>%
  summarise(kappa = 
              irr::kappam.fleiss(matrix(c(bins2, bins3, bins4, bins5, bins6), 
                                        ncol = 5))$value) %>%
  arrange(chosen, set, bin_situation) %>%
  ggplot(aes(x = bin_situation, y = kappa, color = chosen, group = chosen)) +
  geom_point() + 
  geom_line() +
  facet_grid( ~ set) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_color_gretchenalbrecht(palette = "pink_cloud", discrete = TRUE) +
  labs(x = "Bin Type", y = "Kappa", color = "Feature")

Tile Plots of Features Chosen

I also attempted to plot the features chosen by lime in order to visually assess the consistency. This plots are included below.

# Plots the top feature for each case for each bin situation
chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(first)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = situation, y = order, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

# Plots the second feature for each case for each bin situation
chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(second)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = situation, y = order, fill = second)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "Second Feature")

# Plots the third feature for each case for each bin situation
chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(third)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = situation, y = order, fill = third)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "Third Feature")

Attempts at Arranging the Tile Plots

I tried three different arrangements…

# Arranged by number of levels
chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(first)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = situation, y = order, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

# Arranged by case and then number of levels
chosen_features %>%
  group_by(case) %>%
  mutate(nlevels = length(levels(factor(first)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = situation, y = order, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

# No arrangement
chosen_features %>%
  ggplot(aes(x = situation, y = case, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

Now I am trying to use the function seriate from the seriation package to order the cases in the plots. However, something is going wrong. It ends up adding a bunch of NAs to the dataframe after I order the cases. I haven’t figured out what is wrong yet.

order <- chosen_features %>% 
  filter(set == "Set 1", bin_situation == "samesource tree") %>%
  arrange(set, case) %>%
  select(case, nbins, first) %>%
  spread(key = nbins, value = first) %>%
  select(-case) %>%
  cluster::daisy() %>%
  seriate(method = "MDS") %>%
  get_order()

chosen_features %>% 
  filter(set == "Set 1") %>%
  mutate(case = factor(case)) %>%
  ggplot(aes(x = situation, y = case, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

set1 <- chosen_features %>% 
  filter(set == "Set 1") %>%
  mutate(case = factor(case))

set12 <- chosen_features %>%
  filter(set == "Set 1") %>%
  mutate(case = factor(case, levels = order))

order == 272
unique(set1$case)
unique(set12$case)

sum(is.na(set1$case))
sum(is.na(set12$case))

data.frame(set1$case, set12$case) %>% View()

chosen_features %>%
  filter(set == "Set 1") %>%
  mutate(case = factor(case, levels = order))
  ggplot(aes(x = situation, y = case, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "", fill = "Feature", title = "First Feature")

Plot of Number of Features Chosen

I also created the plot below to visually assess the proportions of cases within a bin situation and level of feature chosen by lime (first, second, or third) that have 1 to 5 chosen features. The plots are separated by set.

# Plots the proportions of number of top features chosen across number of bins for a case
# within a binning method
chosen_features %>%
  group_by(set, case, bin_situation) %>%
  summarise(first = length(levels(factor(first))),
            second = length(levels(factor(second))),
            third = length(levels(factor(third)))) %>%
  gather(chosen, nfeatures, first:third) %>%
  ggplot(aes(x = bin_situation, fill = factor(nfeatures))) + 
  geom_bar(position = "stack") + 
  facet_grid(set ~ chosen) + 
  labs(x = "Bin Type", y = "Count", fill = "Number of Features") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_fill_gretchenalbrecht(palette = "winter_light", discrete = TRUE)

Multiple Implementations

Accuracy

In order to assess how variable the LIME results are, I ran each of the input situations 10 times. I wanted to look at the variability of the MSE for each situation, and I wanted to see how consistent the choice of selected features is by LIME.

# Code for organizing the sensitivity analysis results
hamby224_sensitivity_results <- hamby224_sensitivity_joined %>%
  group_by(set, rep, situation, bin_situation, nbins) %>%
  summarise(mse = sum((diff)^2) / length(diff),
            ave_r2 = mean(model_r2))

Comparing MSE and \(R^2\)

The plot below shows boxplots of the mean squared errors computed for each of the input situations for each set. There does not appear to be much variation for most of the cases. The largest amount of variation occurs with 4 and 5 equally spaced bins. There is moderate variability for 3 equally spaced bins. The lowest MSEs occur for the samesource tree based bins followed by the rfscore tree based bins for most cases.

# Plots the MSEs from the sensativity analysis
hamby224_sensitivity_results %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"),
                       "other", as.character(nbins))) %>%
  ggplot(aes(x = bin_situation, y = mse)) + 
  geom_boxplot(aes(color = bin_situation)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") +
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The plot blow shows boxplots of the average \(R^2\) values from the 10 models for all input options and for each set. There is minimal variability in the average \(R^2\) value for all cases, and the relationships between \(R^2\) values and input cases agree with the single implementation seen below.

# Plots the average R^2 value from the sensativity analysis 
hamby224_sensitivity_results %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"),
                       "other", as.character(nbins))) %>%
  ggplot(aes(x = bin_situation, y = ave_r2)) + 
  geom_boxplot(aes(color = bin_situation)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") +
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

Consistency

Plots of Number of Features Chosen

To assess the consistency of the results, I determined the number of different features chosen by LIME as the top feature within a test case across the 10 replicates. I then computed the average number of different top features chosen by LIME within a input situations for each set. The plot below contains these values shown as the dots. The error bars represent one standard deviation above and below the mean. Note that the number of levels cannot go below 1, but the error bars go below 1. I have decided to ignore this for now, but I would need to come up with a better representation if I used this for something.

There are a handful of situations with a mean near one and very little variation. However, the largest mean is around 2, which is not very high. This seems to suggest that LIME is relatively consistent across runs.

# Determine and plot the average and variability of the number levels for each case 
# and input option
hamby224_consistency <- hamby224_sensitivity_joined %>%
  select(set, situation, bin_situation, nbins, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, nbins, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, nbins, case) %>%
  summarise(nlevels = length(levels(factor(feature))),
            count = n()) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, nbins) %>%
  summarise(mean_nlevels = mean(nlevels),
            sd_nlevels = sd(nlevels),
            upper = mean(nlevels) + sd(nlevels),
            lower = mean(nlevels) - sd(nlevels),
            max = max(nlevels),
            min = min(nlevels)) %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"), 
                       "other", as.character(nbins)))

hamby224_consistency %>%
  ggplot(aes(x = situation, y = mean_nlevels)) + 
  geom_errorbar(aes(ymin = lower, ymax = upper), color = "grey80") +
  geom_point(aes(color = bin_situation)) + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  labs(x = "Bin Type", y = "Mean Number of Features", color = "") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The bar chart below shows the proportion of different number of first, second, and third features chosen across all test cases within a set for each input situation. This shows that the most different features chosen was seven. However, many of the cases only have one or two different features chosen, and usually, a larger proportion of the cases have only one top feature chosen across the 10 reps.

# Determine the order that the features are chosen within a case and 
# set of input options
sensitivity_chosen <- hamby224_sensitivity_joined %>%
  select(set, situation, bin_situation, nbins, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  mutate(explainer = rep(c("first", "second", "third"), length(situation) / 3)) %>%
  select(-feature_weight, -feature_weight_abs) %>%
  spread(explainer, feature)

# Determine and plot the number of features chosen across the 10 reps within 
# a set, input options, and feature level
sensitivity_chosen %>%
  group_by(set, situation, case) %>%
  summarise(first = length(levels(factor(first))),
            second = length(levels(factor(second))),
            third = length(levels(factor(third)))) %>%
  gather(chosen, nfeatures, first:third) %>%
  ggplot(aes(x = situation, fill = factor(nfeatures))) + 
  geom_bar(position = "stack") + 
  facet_grid(set ~ chosen) + 
  labs(x = "Bin Type", y = "Count", fill = "Number of Features") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_fill_gretchenalbrecht(palette = "flood_tide", discrete = TRUE, reverse = FALSE)

Fleiss’s Kappa

I computed Fleiss’s kappa for each sampling technique within a set across the 10 reps. These values are plotted below. Right now, the quantile methods and rfscore tree based bins seem to be the most consistent. However, I am not happy with some of the kappa values for cases where the next set of plots would suggest a high kappa, but the value is very low. I wonder if this a flaw in the Fleiss’s kappa and would suggest that I use a different metric.

sensitivity_chosen %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"), 
                       "other", 
                       nbins)) %>%
  gather(key = chosen, value = feature, first:third) %>%
  spread(rep, feature) %>%
  rename(rep1 = "1", rep2 = "2", rep3 = "3", rep4 = "4", rep5 = "5", 
         rep6 = "6", rep7 = "7", rep8 = "8", rep9 = "9", rep10 = "10") %>%
  group_by(set, situation, bin_situation, bins, chosen) %>%
  summarise(kappa = 
              irr::kappam.fleiss(matrix(c(rep1, rep2, rep3, rep4, rep5,
                                          rep6, rep7, rep8, rep9, rep10), 
                                        ncol = 10))$value,
            nunique = length(unique(c(rep1, rep2, rep3, rep4, rep5, 
                                      rep6, rep7, rep8, rep9, rep10)))) %>%
  mutate(kappa = ifelse(is.nan(kappa), 1, kappa)) %>%
  ggplot(aes(x = bin_situation, y = kappa, color = chosen, group = chosen)) + 
  geom_point() + 
  geom_line(aes(linetype = chosen)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_color_gretchenalbrecht(palette = "pink_cloud", discrete = TRUE) +
  labs(x = "Sampling Technique", y = "Kappa", color = "Feature", linetype = "Feature")

Tile Plots of Features Chosen

The plots below need some work, but they contain some interesting information. Right now, I am only concentrating on set 1. We can see that with some lime settings, all cases have the same “best” variable chosen, and other settings have multiple important variables that depend on the case. We can also see that different variables are selected as important depending on the number of bins. It is also interesting to note that when comparing the quantile bins to the equally spaced bins, this plot shows that the equally spaced bins tend to choose the same first variable for all cases. On the other hand, the quantile bins choose different first variables for the cases.

The fact that different features get chosen as the top feature for different estimation techniques makes sense. You would expect that the number of bins would determine which features are better suited for predicting whether or not a comparison is a match. For example, if you look back at the feature distribution plots, you can see that ccf is the obvious choice for determining between matches and non-matches if you have to choose two equally spaced bins. This is the feature that lime most frequently chooses. You can make similar arguments for other cases as well. Even though different inputs select different variables, I would expect that some variables will be better at prediction than others. I think this is what leads some input values to better LIME explanations and lower MSEs than others.

In the meeting with Heike, we talked about how this plot suggests that LIME is not doing a very good job of providing local explanations. Regardless of the case, it is often suggesting only one of two variables are important, and the variable importance depends on the number of bins. In this plot, I would like consistency in the x-direction, but I would like variability in the y-direction to better understand the variables that are important for a specific case.

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case, feature) %>%
  mutate(order = rep(rep(1:length(levels(factor(case))), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5, scales = "free_y") +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "Cases", y = "Rep", fill = "Feature",
       title = "First Feature")

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(2) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case, feature) %>%
  mutate(order = rep(rep(1:length(levels(factor(case))), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5, scales = "free_y") +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "Cases", y = "Rep", fill = "Feature",
       title = "Second Feature")

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(3) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case, feature) %>%
  mutate(order = rep(rep(1:length(levels(factor(case))), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5, scales = "free_y") +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "Cases", y = "Rep", fill = "Feature",
       hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(3) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case, feature) %>%
  mutate(order = rep(rep(1:length(levels(factor(case))), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5, scales = "free_y") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "Cases", y = "Rep", fill = "Feature",
       title = "Third Feature"))

I was thinking a bit more about the relationship between the top features chosen and the type of binning used. I realized it would be helpful to view the relationship between the features and the random forest score. The plots below show the testing data with random forest score versus each of the features. The points are colored by the samesource variable.

hamby224_test %>% 
  select(rf_features, rfscore, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, rfscore, samesource) %>%
  ggplot(aes(x = value, y = rfscore, color = samesource)) + 
  geom_point() + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Random Forest Score", color = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Rankings

In order to decide on which LIME settings to use for the firearm examiners’ paper, I decided to create a table that ranks each of the 12 input settings by MSE from the single implementation, the average MSE across the 10 implementations, the average number of top features chosen across the 10 implementations, and the average \(R^2\) value from the single implementation. The first table below is for set 1, and the second table below is for set 11.

While using these to help me decide which input settings to use, I had a handful of thoughts:

Since I am still working on making sense of the variation results, and it would be nice to get the first paper out, I am thinking of basing the decision of input values entirely off of only the assessment measures from the one case.
Even with needing more time to think about how to understand the meaning of the results from the 10 reps, the mean MSE from these has very similar results to the MSEs from the one rep, and it does not appear that the top chosen features vary too much between the reps. This makes me think that it would be okay to base my decision off of the results from the single rep.
Both Heike and I do not think that the \(R^2\) is that helpful of a measure for determining whether the LIME explanations are good, so I am not going to put much weight on it.
Additionally, I do not want to consider either the normal approximation or kernel density estimate since I have not figured out a good way to visualize these in the app. Right now, the feature plots from these explanations are not very helpful. I would need to adjust the intercept or something.

Based on the MSEs from the single implementation of LIME, I will use 3 equally spaced bins for both the set 1 and set 11 explanations in the firearm examiner paper since it leads to the lowest MSE for both sets.

joined_results <- hamby224_lime_results %>%
  select(set, situation, mse, ave_r2) %>%
  full_join(hamby224_sensitivity_results %>%
              group_by(set, situation) %>%
              summarise(mean_sensitivity_mse = mean(mse), 
                        sd_sensitivity_mse = sd(mse)),
            by = c("set", "situation")) %>%
  full_join(hamby224_consistency %>% select(set, situation, mean_nlevels, sd_nlevels), 
            by = c("set", "situation")) %>%
  group_by(set) %>%
  mutate(mse_order = dense_rank(mse),
         ave_r2_order = dense_rank(desc(ave_r2)),
         mean_sensitivity_mse_order = dense_rank(mean_sensitivity_mse),
         mean_nlevels_order = dense_rank(mean_nlevels))

joined_results %>%
  ungroup() %>%
  filter(set == "Set 1") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 1")

joined_results %>%
  ungroup() %>%
  filter(set == "Set 11") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 11")

Simpler Random Forest

Model

# Fit a random forest model with the top two most important features in the model
if(!file.exists("../../../data/rfmodel.rds")) {
  
  # Fit the random forest  
  rfmodel <- randomForest(factor(samesource) ~ rough_cor + ccf, data = hamby173and252_train)

  # Save the random forest
  saveRDS(rfmodel, "../../../data/rfmodel.rds")
  
} else {
  
  # Load the rf model
  rfmodel <- readRDS("../../../data/rfmodel.rds")
}

# Create a dataset with the test data input to the random forest model 
# and the predictions from the random forest model
hamby224_test_subset <- hamby224_test %>%
  na.omit() %>%
  select(case:land2, rough_cor, ccf, samesource) %>%
  mutate(rfpred = predict(rfmodel, 
                           newdata = hamby224_test %>% select(rough_cor, ccf) %>% na.omit()),
         rfscore = predict(rfmodel, 
                           newdata = hamby224_test %>% select(rough_cor, ccf) %>% na.omit(),
                           type = "prob")[,2])

Applying LIME

The functions lime and explain from the lime package are applied below to the random forest model.

# Create or load the lime and explain objects for the main effects logisitic regression model
if(!file.exists("../../../data/lime_explain_sub.rds")) {
  
  # Set a seed
  set.seed(20190315)
  
  # Apply lime
  lime_sub <- lime(x = hamby173and252_train %>% select(rough_cor, ccf), 
                   model = as_classifier(rfmodel))
  
  # Apply explain
  explain_sub <- explain(x = hamby224_test %>%
                             select(rough_cor, ccf) %>% 
                             na.omit(), 
                           explainer = lime_sub, 
                           n_labels = 1,
                           n_features = 2)
  
  # Join the lime and explain objects in a list
  lime_explain_sub <- list(lime = lime_sub, explain = explain_sub)
  
  # Save the lime and explain objects
  saveRDS(lime_explain_sub, "../../../data/lime_explain_sub.rds")
  
} else {
  
  # Load the lime and explain objects
  lime_explain_sub <- readRDS("../../../data/lime_explain_sub.rds")
 
}

create_bin_df <- function(bin_cut) {
 data.frame(quantile = c("0%", "25%", "50%", "75%", "100%"), 
            bin_cut) 
}

bin_cuts <- purrr::map_df(lime_explain_sub$lime$bin_cuts, 
                          create_bin_df,
                          .id = "feature") %>%
  spread(key = feature, value = bin_cut)

bin_cuts

##   quantile        ccf   rough_cor
## 1       0% 0.01406705 -0.86977947
## 2     100% 0.98110386  0.96354513
## 3      25% 0.21271265 -0.18439396
## 4      50% 0.27499511 -0.05654847
## 5      75% 0.34783793  0.05021862

Visualizations

(case1 <- lime_explain_sub$explain %>% filter(case == 1))

## # A tibble: 2 x 13
##   model_type case  label label_prob model_r2 model_intercept model_prediction
##   <chr>      <chr> <chr>      <dbl>    <dbl>           <dbl>            <dbl>
## 1 classific… 1     FALSE          1    0.339           0.956            0.705
## 2 classific… 1     FALSE          1    0.339           0.956            0.705
## # … with 6 more variables: feature <chr>, feature_value <dbl>,
## #   feature_weight <dbl>, feature_desc <chr>, data <list>, prediction <list>

ggplot() +
  geom_point(data = hamby173and252_train %>% filter(samesource == FALSE),
             aes(x = rough_cor, y = ccf),
             color = "darkgrey",
             alpha = 0.2) +
  geom_point(data = hamby173and252_train %>% filter(samesource == TRUE),
             aes(x = rough_cor, y = ccf),
             color = "darkorange") + 
  geom_vline(xintercept = bin_cuts$rough_cor) + 
  geom_hline(yintercept = bin_cuts$ccf) +
  theme_minimal()

ggplot() + 
  geom_point(data = hamby224_test_subset, 
             aes(x = rough_cor, y = ccf, color = rfscore),
             alpha = 0.5) + 
  scale_color_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) +
  geom_point(data = lime_explain_sub$explain %>% 
               filter(case == 1) %>%
               select(feature, feature_value) %>%
               spread(key = feature, value = feature_value),
             aes(x = rough_cor, y = ccf)) +
  geom_vline(xintercept = bin_cuts$rough_cor) + 
  geom_hline(yintercept = bin_cuts$ccf) + 
  theme_minimal()

Logistic Regression

# Load the model
logistic_mains <- readRDS("../../../data/logistic_mains.rds")

# Read in the LIME explanations for the logistic regression
lime_explain_mains <- readRDS("../../../data/lime_explain_mains.rds")

# Read in the LIME comparisons for the logistic regression
logistic_lime_comparisons <- readRDS("../../../data/logistic_lime_comparisons.rds")

# Read in the combined test data and explanations
logistic_test_explain <- readRDS("../../../data/logistic_test_explain.rds")

Visualizing Default LIME

plot_features(lime_explain_mains$explain[1:12,])

coefs <- logistic_mains$finalModel$coefficients %>% round(2)

The logistic regression model is

\(\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=\) -10.02 + 1.53 ccf + -0.06 rough_cor + 0.78 D + -0.31 sd_D + 1.32 matches + -1.07 mismatches + -0.08 cms + 0.82 non_cms + -0.23 sum_peaks

where \(\hat{p}=P(\mbox{samesource = TRUE})\).

I started trying to compare the results from LIME to the logistic regression model, but I realized that I need to standardize my test data before I use the model to make predictions.

m = function(x) coefs[2:10] * x
m(hamby224_test[1,] %>% select(rf_features))

##         ccf   rough_cor           D          sd_D   matches mismatches
## 1 0.4137459 -0.01622533 0.001089166 -0.0006480818 0.7790299  -8.149276
##           cms non_cms  sum_peaks
## 1 -0.04721394 3.12262 -0.3365381

hamby224_test[1:4,] %>% select(rf_features)

##         ccf rough_cor           D        sd_D   matches mismatches       cms
## 1 0.2704221 0.2704221 0.001396367 0.002090586 0.5901742   7.616146 0.5901742
## 2 0.2704221 0.2704221 0.001396367 0.002090586 0.5901742   7.616146 0.5901742
## 3 0.2788290 0.2788290 0.001505819 0.002059285 0.5320479   7.895640 0.5320479
## 4 0.3146965 0.3146965 0.001533943 0.002235877 0.5304097   8.213552 0.5304097
##    non_cms sum_peaks
## 1 3.808073  1.463209
## 2 3.808073  1.463209
## 3 4.462753  1.224632
## 4 4.449008  2.340695

coefs

## (Intercept)         ccf   rough_cor           D        sd_D     matches 
##      -10.02        1.53       -0.06        0.78       -0.31        1.32 
##  mismatches         cms     non_cms   sum_peaks 
##       -1.07       -0.08        0.82       -0.23

hamby224_test[1:4,] %>% select(rf_features) * coefs[2:10]

##           ccf   rough_cor             D          sd_D     matches mismatches
## 1  0.41374589  0.35695724 -3.211644e-04 -0.0006480818  0.48394284   5.940594
## 2 -0.01622533 -0.28935170  2.136441e-03  0.0027595739 -0.13574006  -2.361005
## 3  0.21748662 -0.02230632 -9.034913e-05 -0.0022034349  0.81403330  10.422245
## 4 -0.09755591  0.25805112  1.196476e-03 -0.0001788702 -0.03182458  -8.788501
##           cms    non_cms  sum_peaks
## 1 -0.04721394 -0.2284844 -1.5656340
## 2  0.48394284  2.9702970 -0.1170567
## 3 -0.12237102 -1.3834535  1.0041978
## 4  0.81152686  5.8726899 -0.5383599

mains_test_explain <- hamby224_test %>%
  mutate(case = as.character(case)) %>%
  full_join(lime_explain_mains$explain, by = "case") %>%
  mutate(case = factor(case),
         feature = factor(feature))

Below is a plot of the three chosen features for each of the cases faceted by set. The features are ordered by order they were chosen by lime (ordered by magnitude of the coefficients).

mains_test_explain %>%
  na.omit() %>%
  arrange(case, desc(abs(feature_weight))) %>%
  mutate(feature_order = rep(c("first", "second", "third"), n() / 3)) %>%
  select(case, set, feature, feature_order) %>%
  spread(feature_order, feature) %>%
  arrange(set, first, second, third) %>%
  mutate(order = 1:n()) %>%
  gather(feature_order, feature, 3:5) %>%
  ggplot(aes(x = feature_order, y = order, fill = feature)) + 
  geom_tile() + 
  facet_grid(set ~ ., scales = "free_y") +
  theme_bw() +
  gretchenalbrecht::scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  labs(x = "Order of Features Chosen", y = "Case", fill = "Feature")

pca_results <- princomp(hamby224_test %>% select(rf_features) %>% na.omit())
pca_results$loadings

## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## ccf                                            0.703                0.707
## rough_cor                                      0.703               -0.707
## D                                                    -0.456  0.890       
## sd_D                                                 -0.890 -0.456       
## matches     0.453  0.457        -0.331 -0.685                            
## mismatches -0.223 -0.210  0.530 -0.777  0.147                            
## cms         0.453  0.507  0.358         0.627                            
## non_cms    -0.156         0.762  0.525 -0.334                            
## sum_peaks   0.717 -0.694                                                 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111
## Cumulative Var  0.111  0.222  0.333  0.444  0.556  0.667  0.778  0.889  1.000

z1 <- pca_results$scores[,1]
z2 <- pca_results$scores[,2]
ggplot(data.frame(z1, z2), aes(x = z1, y = z2)) + 
  geom_point()

biplot(pca_results)

One Implementation

Accuracy

Comparing Input Values by MSE and \(R^2\)

# Summarizing lime results
logistic_lime_results <- logistic_lime_comparisons %>%
  filter(!is.na(rfscore)) %>%
  group_by(situation, bin_situation, bin_continuous, quantile_bins, nbins, use_density, 
           bin_method, response, set) %>%
  summarise(mse = (sum(diff^2)) / length(diff),
            ave_r2 = mean(model_r2)) %>%
  arrange(set) %>%
  ungroup()

The plot below shows the mean squared errors for each of the input situations faceted by set. There is no one bin type that stands out as performing the best across all or most number of bins. This is unlike the results from the random forest model where the tree based bins usually performed the best.

# Plot of MSEs
logistic_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = mse, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") + 
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The plot below shows the average \(R^2\) value for each of the input situations faceted by set. The results are very similar for each set, and the highest \(R^2\) values occur for 2 quantile bins. With the logistic regression, the equally spaced bins or the quantile bins seem to perform the best in terms of \(R^2\) across the different number of bins. The rfscore tree based bins are not as good as they were with the random forest model. However, all of the \(R^2\) values are still very low for all cases.

# Plot of R^2s
logistic_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = ave_r2, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "Bin Type", y = "Average R2", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

Consistency

# Creates a dataset that creates columns of the chosen features by lime from first to third
logistic_chosen_features <- logistic_test_explain %>%
  filter(!(bin_situation %in% c("kernel density", "normal approximation"))) %>%
  filter(!is.na(rfscore)) %>%
  select(set, situation, bin_situation, nbins, case, samesource, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, desc(feature_weight_abs)) %>%
  mutate(explainer = rep(c("first", "second", "third"), length(situation) / 3)) %>%
  select(-feature_weight, -feature_weight_abs) %>%
  spread(explainer, feature)

Fleiss’s Kappa

I computed kappa for each bin situation separately for the top three features chosen. These values are plotted below for each set. For the first feature chosen by lime, the values of kappa suggest that quantile bins are the most consistent across all the number of bins followed by the samesource tree based bins.

# Computes and plots the values of kappa
logistic_chosen_features %>%
  ungroup() %>%
  select(-situation) %>%
  gather(key = chosen, value = feature, first:third) %>%
  spread(nbins, feature) %>%
  select(-case) %>%
  rename(bins2 = "2", bins3 = "3", bins4 = "4", bins5 = "5", bins6 = "6") %>%
  group_by(set, bin_situation, chosen) %>%
  summarise(kappa = 
              irr::kappam.fleiss(matrix(c(bins2, bins3, bins4, bins5, bins6), 
                                        ncol = 5))$value) %>%
  arrange(chosen, set, bin_situation) %>%
  ggplot(aes(x = bin_situation, y = kappa, color = chosen, group = chosen)) +
  geom_point() + 
  geom_line() +
  facet_grid( ~ set) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_color_gretchenalbrecht(palette = "pink_cloud", discrete = TRUE) +
  labs(x = "Bin Type", y = "Kappa", color = "Feature")

Tile Plots of Features Chosen

I also attempted to plot the features chosen by lime in order to visually assess the consistency. This plots are included below. The results appear to be different from the random forest results, but it is still possible to se vertical stripes that suggest a global interpretation as opposed to a local interpretation. However, the samesource bins do have some horizontal stipes that show what we would like to see. It is also interesting to note that the vertical bins are the most pronounced with the rfscore tree. This does not suprise me considering that these explanations are for a logistic regression and not the random forest model.

# Plots the top feature for each case for each bin situation
logistic_chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(first)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = nbins, y = order, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Number of Bins", y = "Test Case", fill = "Feature", title = "First Feature")

# Plots the second feature for each case for each bin situation
logistic_chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(second)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = nbins, y = order, fill = second)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Number of Bins", y = "Test Case", fill = "Feature", title = "Second Feature")

# Plots the third feature for each case for each bin situation
logistic_chosen_features %>%
  group_by(bin_situation, case) %>%
  mutate(nlevels = length(levels(factor(third)))) %>%
  ungroup() %>%
  arrange(bin_situation, set, samesource, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(unique(case)), each = 5, 4))) %>%
  ggplot(aes(x = nbins, y = order, fill = third)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Number of Bins", y = "Test Case", fill = "Feature", title = "Third Feature")

Session Info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0          stringr_1.4.0          dplyr_1.0.2           
##  [4] purrr_0.3.4            readr_1.3.1            tidyr_1.1.2           
##  [7] tibble_3.0.3           ggplot2_3.3.2.9000     tidyverse_1.3.0       
## [10] randomForest_4.6-14    lime_0.5.1             gretchenalbrecht_0.1.0
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1        ellipsis_0.3.1          class_7.3-17           
##   [4] fs_1.5.0                rstudioapi_0.11         farver_2.0.3           
##   [7] DT_0.15                 prodlim_2019.11.13      fansi_0.4.1            
##  [10] lubridate_1.7.9         xml2_1.3.2              codetools_0.2-16       
##  [13] splines_4.0.2           robustbase_0.93-6       knitr_1.29             
##  [16] shinythemes_1.1.2       jsonlite_1.7.1          pROC_1.16.2            
##  [19] caret_6.0-86            broom_0.7.0             dbplyr_1.4.4           
##  [22] shiny_1.5.0             compiler_4.0.2          httr_1.4.2             
##  [25] backports_1.1.10        assertthat_0.2.1        Matrix_1.2-18          
##  [28] fastmap_1.0.1           lazyeval_0.2.2          cli_2.0.2              
##  [31] later_1.1.0.1           htmltools_0.5.0         tools_4.0.2            
##  [34] gtable_0.3.0            glue_1.4.2              reshape2_1.4.4         
##  [37] Rcpp_1.0.5              cellranger_1.1.0        vctrs_0.3.4            
##  [40] nlme_3.1-148            iterators_1.0.12        crosstalk_1.1.0.1      
##  [43] timeDate_3043.102       gower_0.2.2             xfun_0.17              
##  [46] irr_0.84.1              rvest_0.3.6             mime_0.9               
##  [49] lpSolve_5.6.15          miniUI_0.1.1.1          lifecycle_0.2.0        
##  [52] DEoptimR_1.0-8          MASS_7.3-51.6           zoo_1.8-8              
##  [55] scales_1.1.1            ipred_0.9-9             hms_0.5.3              
##  [58] promises_1.1.1          yaml_2.2.1              curl_4.3               
##  [61] rpart_4.1-15            stringi_1.5.3           foreach_1.5.0          
##  [64] TTR_0.24.2              manipulateWidget_0.10.1 lava_1.6.7             
##  [67] shape_1.4.4             rlang_0.4.7             pkgconfig_2.0.3        
##  [70] rgl_0.100.54            evaluate_0.14           lattice_0.20-41        
##  [73] bulletr_0.1.0.9000      recipes_0.1.13          htmlwidgets_1.5.1      
##  [76] labeling_0.3            tidyselect_1.1.0        plyr_1.8.6             
##  [79] magrittr_1.5            R6_2.4.1                generics_0.0.2         
##  [82] DBI_1.1.0               pillar_1.4.6            haven_2.3.1            
##  [85] withr_2.2.0             mgcv_1.8-31             xts_0.12.1             
##  [88] nnet_7.3-14             survival_3.1-12         modelr_0.1.8           
##  [91] crayon_1.3.4            utf8_1.1.4              plotly_4.9.2.1         
##  [94] rmarkdown_2.3           grid_4.0.2              readxl_1.3.1           
##  [97] data.table_1.13.0       blob_1.2.1              ModelMetrics_1.2.2.2   
## [100] reprex_0.3.0            digest_0.6.25           webshot_0.5.2          
## [103] xtable_1.8-4            httpuv_1.5.4            stats4_4.0.2           
## [106] munsell_0.5.0           glmnet_4.0-2            viridisLite_0.3.0      
## [109] smoother_1.1

Assessing Lime

Katherine Goode

September 29, 2020

Random Forest

Bar Charts of Bins

One Implementation

Accuracy

Consistency

Multiple Implementations

Accuracy

Consistency

Rankings

Simpler Random Forest

Model

Applying LIME

Visualizations

Logistic Regression

Visualizing Default LIME

One Implementation

Accuracy

Consistency

Session Info