This journal explains the objectives and ideas for the research project on applying LIME to the Hamby bullet data that is detailed in this series of journals.

Objectives

Overall

Understand how LIME works
Consider ways to improvement LIME
Apply LIME (or some improved version of LIME) to interpret the random forest model fit to the Hamby bullet data

Ideas

LIME diagnostics

add in feature selection methods to LIME input options
think of a way to compute consistency across top two features
Siggi suggests refitting the RF model to the perturbations and then continuing with LIME with the RF predictions from the new model - this may help to understand if the problems are due to the sampling procedure or LIME itself
he also suggested looking into SMOTE for dealing the imbalance in the classes with sampling
compare the simple models based on different number of bins using an F-test
include a penalty for the number of parameters when choosing bins
look at the AUC after binning
compute a likelihood ratio prob TRUE / prob FALSE from the LIME ridge regression
try visualizing the features from the test data using dimension reduction and coloring them by variables suggested to be important by lime
could try fitting a regression with interactions and see if LIME does a good job of explaining a model that is already interpretable
come up with a test to compare between global and local explanations

Understanding LIME

Run a simulation to understand if LIME is working
- could implement a couple of local linear dependencies
- piece this together
- could include interactions in the model
- does lime find the local models?
try fitting LASSO logistic model and leave one out approach (for multicollinarity)
try reticulate R package to apply python version of lime
look into literature on binning methods
think about why R^2 would be better for some binning methods
read new paper on Anchor

Possible Improvements to LIME

determine the best number of bins to use for each variable
try out subsampling idea

Concerns

The following are some of the concerns that we have with the current state of the LIME algorithm.

I’m nervous about the fact that the results can change due to the permutations. Is there a way to check for consistency? Does this only happen if you have correlated variables, or can it also happen with uncorrelated variables?
When you have a large number of predictions to assess, would it be a good idea to focus in on the ones that have the best fitting linear model or produce the most consistent results?
What can be done to improve the linear regression model fit? Maybe adjusting the number of bins or the kernel width would help with this.
We think that the model explainer needs to be close enough to the complex model that it is trying to explain in order to do a good job of providing explanations. For example, the binned regression works okay with neural networks, but it does not work well with a random forest. Maybe a tree explainer or a logistic regression explainer would work better with a random forest.

General Thoughts

LIME is kind of like a jackknife technique

Objectives and Ideas

Katherine Goode

October 07, 2020

Objectives

Overall

Ideas

Concerns

General Thoughts