This journal explains the objectives and ideas for the research project on applying LIME to the Hamby bullet data that is detailed in this series of journals.

Objectives

Overall

  • Understand how LIME works
  • Consider ways to improvement LIME
  • Apply LIME (or some improved version of LIME) to interpret the random forest model fit to the Hamby bullet data

Ideas

LIME diagnostics

  • add in feature selection methods to LIME input options
  • think of a way to compute consistency across top two features
  • Siggi suggests refitting the RF model to the perturbations and then continuing with LIME with the RF predictions from the new model - this may help to understand if the problems are due to the sampling procedure or LIME itself
  • he also suggested looking into SMOTE for dealing the imbalance in the classes with sampling
  • compare the simple models based on different number of bins using an F-test
  • include a penalty for the number of parameters when choosing bins
  • look at the AUC after binning
  • compute a likelihood ratio prob TRUE / prob FALSE from the LIME ridge regression
  • try visualizing the features from the test data using dimension reduction and coloring them by variables suggested to be important by lime
  • could try fitting a regression with interactions and see if LIME does a good job of explaining a model that is already interpretable
  • come up with a test to compare between global and local explanations

Understanding LIME

  • Run a simulation to understand if LIME is working
    • could implement a couple of local linear dependencies
    • piece this together
    • could include interactions in the model
    • does lime find the local models?
  • try fitting LASSO logistic model and leave one out approach (for multicollinarity)
  • try reticulate R package to apply python version of lime
  • look into literature on binning methods
  • think about why R^2 would be better for some binning methods
  • read new paper on Anchor

Possible Improvements to LIME

  • determine the best number of bins to use for each variable
  • try out subsampling idea

Concerns

The following are some of the concerns that we have with the current state of the LIME algorithm.

  • I’m nervous about the fact that the results can change due to the permutations. Is there a way to check for consistency? Does this only happen if you have correlated variables, or can it also happen with uncorrelated variables?
  • When you have a large number of predictions to assess, would it be a good idea to focus in on the ones that have the best fitting linear model or produce the most consistent results?
  • What can be done to improve the linear regression model fit? Maybe adjusting the number of bins or the kernel width would help with this.
  • We think that the model explainer needs to be close enough to the complex model that it is trying to explain in order to do a good job of providing explanations. For example, the binned regression works okay with neural networks, but it does not work well with a random forest. Maybe a tree explainer or a logistic regression explainer would work better with a random forest.

General Thoughts

  • LIME is kind of like a jackknife technique