This journal explains the objectives and ideas for the research project on applying LIME to the Hamby bullet data that is detailed in this series of journals.
Objectives
Overall
- Understand how LIME works
- Consider ways to improvement LIME
- Apply LIME (or some improved version of LIME) to interpret the random forest model fit to the Hamby bullet data
Ideas
LIME diagnostics
- add in feature selection methods to LIME input options
- think of a way to compute consistency across top two features
- Siggi suggests refitting the RF model to the perturbations and then continuing with LIME with the RF predictions from the new model - this may help to understand if the problems are due to the sampling procedure or LIME itself
- he also suggested looking into SMOTE for dealing the imbalance in the classes with sampling
- compare the simple models based on different number of bins using an F-test
- include a penalty for the number of parameters when choosing bins
- look at the AUC after binning
- compute a likelihood ratio prob TRUE / prob FALSE from the LIME ridge regression
- try visualizing the features from the test data using dimension reduction and coloring them by variables suggested to be important by lime
- could try fitting a regression with interactions and see if LIME does a good job of explaining a model that is already interpretable
- come up with a test to compare between global and local explanations
Understanding LIME
- Run a simulation to understand if LIME is working
- could implement a couple of local linear dependencies
- piece this together
- could include interactions in the model
- does lime find the local models?
- try fitting LASSO logistic model and leave one out approach (for multicollinarity)
- try reticulate R package to apply python version of lime
- look into literature on binning methods
- think about why R^2 would be better for some binning methods
- read new paper on Anchor
Possible Improvements to LIME
- determine the best number of bins to use for each variable
- try out subsampling idea
Concerns
The following are some of the concerns that we have with the current state of the LIME algorithm.
- Iām nervous about the fact that the results can change due to the permutations. Is there a way to check for consistency? Does this only happen if you have correlated variables, or can it also happen with uncorrelated variables?
- When you have a large number of predictions to assess, would it be a good idea to focus in on the ones that have the best fitting linear model or produce the most consistent results?
- What can be done to improve the linear regression model fit? Maybe adjusting the number of bins or the kernel width would help with this.
- We think that the model explainer needs to be close enough to the complex model that it is trying to explain in order to do a good job of providing explanations. For example, the binned regression works okay with neural networks, but it does not work well with a random forest. Maybe a tree explainer or a logistic regression explainer would work better with a random forest.
General Thoughts
- LIME is kind of like a jackknife technique