This journal includes a description of the Hamby datasets and models fit to the data. The specification and cleaning of the training and testing datasets are documented, and some initial visualizations of the random forest model are included. The models included are the random forest model rtrees and several logistic regression models.

# Load libraries
library(bulletxtrctr)
library(cowplot)
library(ggConvexHull) #devtools::install_github("cmartin/ggConvexHull")
library(GGally)
library(ggpcp)
library(glmnet)
library(plotly)
library(randomForest)
library(tidyverse)

Hamby Data

Description

The Hamby data is based on several test sets of bullets from the study described in the paper “The Identification of Bullets Fired from 10 Consecutively Rifled 9mm Ruger Pistol Barrels: A Research Project Involving 507 Participants from 20 Countries” by James E. Hamby Et. Al. In this study, sets of bullets from both “known” and “unknown” gun barrels were sent to firearm examiners around the world. The examiners were asked to use the known bullets to identify which barrels the unknown bullets came from.

The test sets were created using 1 pistol and 10 barrels. Each test set contains a total of 35 bullets, which are made up of 20 known bullets and 15 unknown bullets. The 20 known bullets were created by firing two bullets from each of the 10 barrels. These are referred to as known bullets, because when they were sent to the firearm examiners, the barrel number that each bullet was fired from was listed with the bullet. The 15 unknown bullets were created by firing 15 bullets in some manner from the 10 barrels such that at least one unknown bullet came from each barrel and no more than three unknown bullets came from the same barrel. These are referred to as the unknown bullets, because when they were sent to the firearm examiners, the barrel number that the bullet was fired from was not listed with the bullet. A total of 240 test sets were created for the study.

CSAFE has access to test sets 44, 173, and 252. The bullets were scanned 6 times using a high powered microscope to obtain an image of each of the 6 lands from the bullets. The scans for test sets 173 and 252 were done by NIST, and the scans for test set 44 was done by CSAFE. The data from these images were processed to obtain a signature associated with each land. The paper “Automatic Matching of Bullet Land Impressions” by Hare, Hofmann, and Carriquiry (https://arxiv.org/abs/1601.05788) provides more descriptions of how the signatures were obtained.

CSAFE aggregated the signatures from test sets 173 and 252 and left the signatures from test set 44 separate. Within these two groups, pairs of lands were evaluated to determine how similar the signatures from the two lands were. This was done by measuring a set of variables they determined that would capture how alike the two signatures where. Some of these variables are described in Hare, Hofmann, and Carriquiry. The vignette at https://github.com/heike/bulletxtrctr/blob/master/vignettes/features.Rmd includes some additional descriptions. Note that it was not possible to evaluate all pairs of lands due to tank rash on some of the lands. The data set created from the comparisons of test sets 173 and 252 will be referred to as hamby173and252 throughout these journals. Hare, Hofmann, and Carriquiry used hamby173and252 as a training data set to fit the random forest model rtrees. The data set created from the comparisons of test sets 44 will be referred to as hamby44 and is not used in this research project.

Both the training and the testing data will require the use of a vector of the names of the features used in the rtrees random forest included in the bulletxtrctr library. I created such a vector to use in this journal, and this vector is printed below. It includes nine features.

# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(bulletxtrctr::rtrees$importance)
rf_features
## [1] "ccf"        "rough_cor"  "D"          "sd_D"       "matches"   
## [6] "mismatches" "cms"        "non_cms"    "sum_peaks"

At some point, I would like to write down definitions of all the features.

  • ccf:
  • cms:
  • D:
  • matches:
  • mismatches:
  • non_cms:
  • rough_cor:
  • sd_D:
  • sum_peaks:

Training Data

This was originally a part of a journal entry that I wrote in my ‘Case Studies with LIME’ repository. I took the code that I used to clean the training dataset from that entry and updated it in this entry. It should still produce essentially the same dataset (with some possible changes to the level names of some of the barrels due to a ‘factor’ issue). The dataset that gets saved from this journal is the one that I am using for this research project.

Raw Data

The dataset loaded in below is the original Hamby 172 and 252 dataset that Heike gave to me. Note that when the hamby173and252 dataset is read in, the studies called “Cary” are excluded. The data file contains rows based on bullet scans from a different study. These rows are no longer being included since Heike has found the study they came from to be poorly executed.

# Load in the Hamby 173 and 252 dataset
hamby173and252_raw <- read.csv("../../../data/raw/features-hamby173and252.csv") %>%
  filter(study1 != "Cary", study2 != "Cary") %>%
  mutate(study1 = factor(study1), study2 = factor(study2))

Number of Rows

If we include symmetric comparisons, each set of test bullets should result in a dataset with \[(35 \mbox{ bullets} \times 6 \mbox{ lands})^2=44100 \mbox{ rows},\] where a row would contain information on a pair of lands. If we do not include the symmetric comparisons, then the dataset should have \[\frac{(44100 \mbox{ rows} - (35 \mbox{ bullets} \times 6))}{2} + (35 \mbox{ bullets} \times 6) = 22155 \mbox{ rows}.\] However, when I looked at the dimension of the datasets, neither of these seem to be the case. See the R code and output below. Note that hamby173 is currently incorrectly labeled as hamby44. Both test sets have less than but close to 22,155 rows. This suggests that these do not include symmetric comparisons. When I checked with Heike, she confirmed that this is the case. This table also shows that there are comparisons across hamby173 and hamby252. These missing observations will be explored more in the next section.

# Summary of the number of observations in the Hamby173and252 database
table(hamby173and252_raw$study1, hamby173and252_raw$study2)
##           
##            Hamby252 Hamby44
##   Hamby252    20910   16862
##   Hamby44     25573   21321

Missing Observations

The plot below considers the number of observations within a barrel and bullet comparison from all known cases in the Hamby 173 and 252 data. We can see that the observations on the lower diagonals are missing in all cases which confirms that the symmetric comparisons were not included in the data. Additionally, a handful of cases have less than 36 observations. For the comparisons within the Hamby 173 or Hamby 252 study, the cells on the diagonals are less than 36, because none of the repeats from the symmetric comparisons of lands are included. The cells above the diagonal with less than 36 observations are missing some observations due to tank rash. For the comparisons across studies, the cases with less than expected are also due to tank rash. For some reason, the comparisons between bullets 1 from Hamby 173 and Hamby 252, the cells are being colored gray even though they have 36 observations. I am not sure why this is…

# Create the plot to look at number of comparisons within the known bullets
countplot <- hamby173and252_raw %>%
  filter(barrel1 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
         barrel2 %in% c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")) %>%
  group_by(study1, study2, barrel1, barrel2, bullet1, bullet2) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = barrel1, y = barrel2)) +
  geom_tile(aes(fill = count)) + 
  facet_grid(study1 + bullet1 ~ study2 + bullet2, scales = "free") +
  theme_minimal() + 
  scale_fill_distiller(palette = "GnBu", direction = 1) + 
  theme(legend.position = "bottom")

# Make the plot interactive
ggplotly(countplot, width = 800, height = 700) %>%
  shiny::div(align = "center")

Data Cleaning

The code below cleans the training data. The cleaning involves:

  • correcting the study 173 labels
  • adjusting the bullet and barrel values for the unknowns
  • renaming the match variable as samesource
  • selecting the desired variables
# Determine the letters associated with the unknown bullets
all_barrel_labels <-
  unique(c(
    as.character(hamby173and252_raw$barrel1),
    as.character(hamby173and252_raw$barrel2)
  ))
letters <- all_barrel_labels[!(all_barrel_labels %in% 1:10)]
letters
##  [1] "A" "B" "C" "E" "G" "H" "I" "L" "M" "N" "R" "U" "V" "W" "X" "D" "F" "J" "Q"
## [20] "S" "Y" "Z"
#letters <- levels(hamby173and252_raw$barrel1)[11:length(levels(hamby173and252_raw$barrel1))]

# Cleaning the testing data
hamby173and252_train_cleaning <- 
  hamby173and252_raw %>%
  mutate(study1 = fct_recode(study1, "Hamby173" = "Hamby44"),
         study2 = fct_recode(study2, "Hamby173" = "Hamby44")) %>%
  mutate(across(is.factor, as.character)) %>%
  mutate(bullet1 = ifelse(barrel1 %in% letters, barrel1, bullet1),
         barrel1 = ifelse(barrel1 %in% letters, "Unknown", barrel1),
         bullet2 = ifelse(barrel2 %in% letters, barrel2, bullet2),
         barrel2 = ifelse(barrel2 %in% letters, "Unknown", barrel2),
         land1 = as.character(land1),
         land2 = as.character(land2),
         rfscore = predict(bulletxtrctr::rtrees, hamby173and252_raw %>%
                             select(rf_features), 
                           type = "prob")[,2]) %>%
  rename(samesource = match) %>%
  select(study1, barrel1, bullet1, land1, study2, barrel2, bullet2, land2,
         rf_features, samesource, rfscore)

Removing Bullets with Tank Rash

I discovered that the number of rtrees predictions does not match the length of my training data. I need to ask Heike about this.

# Compare the dimensions of my training data and the number of predictions from rtrees
dim(hamby173and252_train_cleaning)
## [1] 84666    19
dim(predict(bulletxtrctr::rtrees, type = "prob"))
## [1] 83028     2

When I talked to Heike about this, she told me that this training dataset probably still contains the four land impressions that had tank ranks and were removed before rtrees was fit. When I looked in Eric’s paper, it said that the lands that were removed were the following four:

  • barrel 6 bullet 2-1
  • barrel 9 bullet 2-4
  • unknown bullet B-2
  • unknown bullet Q-4

The code below removes any comparisons from the testing data that include one of these lands.

# Remove the comparisons involving the four lands with tank rash
hamby173and252_train <- hamby173and252_train_cleaning %>%
  mutate(bbl1 = paste(barrel1, bullet1, land1, sep = ":"),
         bbl2 = paste(barrel2, bullet2, land2, sep = ":")) %>%
  filter(!(bbl1 %in% c("6:2:1", "9:2:4", "Unknown:B:2", "Unknown:Q:4") | 
           bbl2 %in% c("6:2:1", "9:2:4", "Unknown:B:2", "Unknown:Q:4"))) %>%
  select(-bbl1, -bbl2)

When I look at the dimensions of my training data now and the number of predictions from rtrees, they agree now!

# Compare the dimensions of my further cleaned training data and the
# number of predictions from rtrees
dim(hamby173and252_train)
## [1] 83028    19
dim(predict(bulletxtrctr::rtrees, type = "prob"))
## [1] 83028     2

It is possible that the labels that Eric was using are different from the labels in my training dataset. In order to check whether I removed the correct datapoints, Heike suggested that I look at a parallel coordinate plots to see if the observations that were removed stand out as being unusual compared to the rest of the data. The code below prepares the training data before observations were removed to be plotted.

# Create dataset for creating the parallel coordinate plot
par_data <- hamby173and252_train_cleaning %>%
  mutate(bbl1 = paste(barrel1, bullet1, land1, sep = ":"),
         bbl2 = paste(barrel2, bullet2, land2, sep = ":")) %>%
  mutate(category = factor(ifelse(!(bbl1 %in% c("6:2:1", "9:2:4", "Unknown:B:2",
                                                "Unknown:Q:4") | 
                                      bbl2 %in% c("6:2:1", "9:2:4", "Unknown:B:2",
                                                  "Unknown:Q:4")),
                           "keep", "remove"))) %>%
  arrange(category) %>%
  mutate(samesource_colors = ifelse(samesource == TRUE, rgb(1, 0, 0, alpha = 0.05), 
                                    rgb(0, 0, 0, alpha = 0.001)),
         samesource_colors2 = ifelse(samesource == TRUE, rgb(1, 0, 0, alpha = 0.001), 
                                    rgb(0, 0, 0, alpha = 0.01)),
         tankrash_colors = ifelse(category == "keep", rgb(0, 0, 1, alpha = 0.05),
                                  rgb(0, 0, 0, alpha = 0.01)))

The first plot shows the observations that were kept in black and the observations that were removed in blue. These observations do not stand out to me. It is not clear if these were the ones that were suppose to be removed or not. The second plot shows the observations from matches plotted in red, and the third plot shows the observations from non-matches plotted in black.

MASS::parcoord(par_data %>% select(ccf:sum_peaks), col = par_data$tankrash_colors)

MASS::parcoord(par_data %>% select(ccf:sum_peaks), col = par_data$samesource_colors)

MASS::parcoord(par_data %>% select(ccf:sum_peaks), col = par_data$samesource_colors2)

Saving the Data

The cleaned data is saved and used as the training dataset for the rest of this research project.

# Save the datasets and response variables as .csv files
write.csv(hamby173and252_train, "../../../data/hamby173and252_train.csv", row.names = FALSE)

Testing Data

Raw Data

The original data files given to me by Heike are loaded in below. For now, we are working with only sets 1 and 11 from the Hamby 224 study. She may provide me with more in the future.

# Load in the Hamby 224 datasets
hamby224_set1 <- readRDS("../../../data/raw/h224-set1-features.rds")
hamby224_set11 <- readRDS("../../../data/raw/h224-set11-features.rds")

Data Cleaning

The code below cleans the data from both sets 1 and 11. This involves:

  • selecting the desired variables
  • renaming the bullet and land variables
  • creating study and set variables
  • re-coding the bullet and land names
# Clean the Hamby 224 set 1 data
hamby224_set1_cleaned <- hamby224_set1 %>%
  select(-bullet_score, -land1, -land2, -aligned, -striae, -features) %>%
  rename(bullet1 = bulletA,
         bullet2 = bulletB, 
         land1 = landA,
         land2 = landB) %>%
  mutate(study = factor("Hamby 224"), 
         set = factor("Set 1"),
         bullet1 = recode(factor(bullet1), 
                          "1" = "Known 1", "2" = "Known 2", "Q" = "Questioned"),
         bullet2 = recode(factor(bullet2), 
                          "1" = "Known 1", "2" = "Known 2", "Q" = "Questioned"),
         land1 = recode(factor(land1), 
                        "1" = "Land 1", "2" = "Land 2", "3" = "Land 3", 
                        "4" = "Land 4", "5" = "Land 5", "6" = "Land 6"),
         land2 = recode(factor(land2), 
                        "1" = "Land 1", "2" = "Land 2", "3" = "Land 3", 
                        "4" = "Land 4", "5" = "Land 5", "6" = "Land 6")) %>%
  select(study, set, bullet1:land2, rf_features, rfscore, samesource)

# Clean the Hamby 224 set 11 data
hamby224_set11_cleaned <- hamby224_set11 %>%
  select(-bullet_score, -land1, -land2, -aligned, -striae, -features) %>%
  rename(bullet1 = bulletA,
         bullet2 = bulletB, 
         land1 = landA,
         land2 = landB) %>%
  mutate(study = factor("Hamby 224"), 
         set = factor("Set 11"),
         bullet1 = recode(factor(bullet1), 
                          "Bullet 1" = "Known 1", "Bullet 2" = "Known 2", 
                          "Bullet I" = "Questioned"),
         bullet2 = recode(factor(bullet2), 
                          "Bullet 1" = "Known 1", "Bullet 2" = "Known 2", 
                          "Bullet I" = "Questioned")) %>%
  select(study, set, bullet1:land2, rf_features, rfscore, samesource)

The cleaned data from sets 1 and 11 are combined below into the testing dataset. Rows are added for the missing comparisons from the Hamby 224 study, and some additional cleaning is done.

# Create a dataset with all combinations of lands and bullets comparisons for each set
combinations <- data.frame(set = factor(rep(c("Set 1", "Set 11"), each = 324)),
                    expand.grid(land1 = factor(c("Land 1", "Land 2", "Land 3", 
                                                 "Land 4", "Land 5", "Land 6")),
                                land2 = factor(c("Land 1", "Land 2", "Land 3", 
                                                 "Land 4", "Land 5", "Land 6")),
                                bullet1 = factor(c("Known 1", "Known 2", "Questioned")),
                                bullet2 = factor(c("Known 1", "Known 2", "Questioned"))))

# Join the two cleaned Hamby 224 sets into one testing set
hamby224_test <- suppressWarnings(bind_rows(hamby224_set1_cleaned,
                                            hamby224_set11_cleaned)) %>%
  mutate(set = factor(set),
         bullet1 = factor(bullet1),
         bullet2 = factor(bullet2),
         land1 = factor(land1),
         land2 = factor(land2)) %>%
  right_join(combinations, by = c("set", "land1", "land2", "bullet1", "bullet2")) %>%
  filter(!(bullet1 == "Questioned" & bullet2 == "Known 1"),
         !(bullet1 == "Questioned" & bullet2 == "Known 2"),
         !(bullet1 == "Known 2" & bullet2 == "Known 1")) %>%
  arrange(rfscore) %>%
  mutate(case = factor(1:length(study))) %>%
  select(case, study, set, bullet1, land1, bullet2, land2, ccf:sum_peaks, samesource, rfscore)

Saving the Data

The testing data file is saved below.

# Save the test data as a .csv file
write.csv(hamby224_test, "../../../data/hamby224_test.csv", row.names = FALSE)

Plots of Features

Distributions by samesource

I created the following two plots of the training data as suggested by Heike. The histograms below show the distributions of the features used in the random forest rtrees. The histograms are filled by the samesource variable, which is the truth of whether or not the comparison is from the same barrel and land. The default histograms make it hard to compare the distributions of the matches and non-matches since there are many more comparisons that have samesource == FALSE.

# Create plots of the feature distributions colored by samesource for the training data
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

By setting position = "fill" in the geom_histogram function, it is easier to compare the matches and non-matches. These plots could be used in the future to hand select the bins for lime. Additionally, fitting a logistic regression to this data could also be used to determine the LC50, LC10, ad LC90, which could be used as the bins for lime.

# Create plots of the feature distributions colored by samesource for the training data
# using the position = "fill" option
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

The plots below have the same structure, but they are created with the Hamby 224 testing data. I chose to not separate the testing data by sets, but this is something that could be done later if necessary.

# Create plots of the feature distributions colored by samesource for the testing data
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

# Create plots of the feature distributions colored by samesource for the testing 
# data using the position = "fill" option
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Feature Interactions
hamby224_test %>%
  select(rf_features, samesource) %>%
  ggpairs(mapping = aes(color = samesource), 
          columns = rf_features)

Correlations

I made these plots to look at the correlation between features in the training data within the TRUE and FALSE cases of samesource. The features are highly correlated for the match comparisons. It is clear that the variables are more correlated with the match comparisons than the non-match comparisons. However, there are still some variables that are relatively highly correlation with the non-match comparisons.

# Create a correlation heatmap of the match comparisons in the training data
cor_match <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Training Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend <- get_legend(cor_match + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title, 
          plot_grid(cor_match, cor_nonmatch, cor_legend, ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

The plots below show the correlations for the testing data. The patterns in the plots look really similar to ones of the training data.

# Create a correlation heatmap of the match comparisons in the training data
cor_match_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title_test <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Testing Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend_test <- get_legend(cor_match_test + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title_test, 
          plot_grid(cor_match_test, cor_nonmatch_test, cor_legend_test, 
                    ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

Random Forest: rtrees

Description

The random forest model that was fit to the Hamby training dataset is discussed in the paper “Automatic Matching of Bullet Land Impressions” by E. Hare Et. Al. The model is available through the GitHub version of the bulletxtrctr package. The link to the GitHub repository for bulletxtrctr is https://github.com/heike/bulletxtrctr.

The random forest model is stored in the package bulletxtrctr as a data object called rtrees. The paper does not include many details about the model fitting process, so I looked at some of the model features, which are printed below. These values told me that the model was fit using the predictive features of ccf, rough_cor, D, sd_D, matches, mismatches, cms, non_cms, and sum_peaks with 300 trees and the parameter of mtry set to 3, which is the “number of variables randomly sampled as candidates at each split” as explained in the randomforest package documentation.

# Importance values from the random forest
rtrees$importance
##            MeanDecreaseGini
## ccf               567.53791
## rough_cor         666.62447
## D                 164.76236
## sd_D              101.73478
## matches           305.13778
## mismatches        261.84430
## cms                94.96370
## non_cms            96.75017
## sum_peaks         120.49836
# The number of trees used in the random forest
rtrees$ntree
## [1] 300
# "Number of variables randomly sampled as candidates at each split"
rtrees$mtry
## [1] 3

Visualizations

Random Forest Scores

Facet Version

The plot below is the model for the one that will be used in the app for exploring the lime explanations from the bullet matching data. This is the data from set 1 of the testing dataset.

# Heatmap of rfscore for each comparison in set 1
hamby224_test %>%
  filter(set == "Set 1") %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2,
             text = paste('Bullets Compared: ', bullet1, "-", land1, 
                          "vs", bullet2, "-", land2,
                          '\nRandom Forest Score: ', 
                          ifelse(is.na(rfscore), "Missing due to tank rash", rfscore)))) +
  geom_tile(aes(fill = rfscore)) +
  facet_grid(bullet2 ~ bullet1, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  labs(x = "", y = "", fill = "RF Score")

Individual Plots Version

I also wrote some code to create the above plot using individual plots instead of facets. This is included below. (It is a bit outdated compared to my newer version of the plot.)

# Code for creating the heatmap plots using individual plots
p1 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 1") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p2 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p3 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p4 <- ggplot() + geom_blank() + theme_classic()

p5 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p6 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p7 <- ggplot() + geom_blank() + theme_classic()

p8 <- ggplot() + geom_blank() + theme_classic()

p9 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Questioned", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5, limits = c(0,1)) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score")

style(subplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, 
              nrows = 3, 
              titleX = TRUE, 
              titleY = TRUE, 
              margin = 0.03),
      hoverinfo = "skip",
      traces = 7)
Parallel Coordinate Plots
# Create a dataframe with the random forest features ordered by importance
rf_features_ordered <- data.frame(feature = rownames(bulletxtrctr::rtrees$importance), 
                                  MeanDecreaseGini = bulletxtrctr::rtrees$importance) %>%
  arrange(desc(MeanDecreaseGini))

Below is a parallel coordinate plot of the training dataset. It is faceted by samesource and colored by rfscore. The features are sorted by the variable importance scores (descending from left to right) from the rtrees model. It is interesting to see that the matches with low rfscores have feature values that are similar to the nonmatches.

hamby173and252_train %>%
  na.omit() %>%
  ggplot(aes(color = rfscore)) + 
  geom_pcp(aes(vars = vars(as.vector(rf_features_ordered$feature))), alpha = 0.1) + 
  facet_grid(. ~ samesource) + 
  scale_color_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "Feature (sorted by rtrees variable importance)", 
       y = "Standardized Feature Values") 

Below is a parallel coordinate plot of the testing dataset. It is faceted by samesource and colored by rfscore. You can see a clear distinction between the trends of the feature values from the nonmatches and matches. It is interesting to consider the cases in set 1 that are matches but were predicted to be nonmatches by the model. These are the cases colored in gray, and their feature values are very different from the observations that were predicted to be matches by the model. The nonmatch predictions have values that are closer to the nonmatches.

hamby224_test %>%
  na.omit() %>%
  ggplot(aes(color = rfscore)) + 
  geom_pcp(aes(vars = vars(as.vector(rf_features_ordered$feature))), alpha = 0.1) + 
  facet_grid(. ~ samesource) + 
  scale_color_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "Feature (sorted by rtrees variable importance)", 
       y = "Standardized Feature Values") 

Partial Dependence Plots

I tried using the package pdp, but I got the following error.

library(pdp)
partial(object = rtrees, pred.var = "rough_cor", plot = TRUE, plot.engine = "ggplot2")
## Error in eval(stats::getCall(object)$data): object 'CCFs_withlands' not found

Then I tried creating some plots myself.

# Specify the number of breaks in the grid
n <- 100000

# Create a prediction grid
set.seed(20191108)
simulated_data <- hamby173and252_train %>%
  select(rf_features) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarise(min = min(value),
            max = max(value)) %>%
  rowwise() %>%
  mutate(samples = list(runif(n, min, max))) %>%
  select(variable, samples) %>%
  unnest() %>%
  mutate(case = rep(1:n, 9)) %>%
  spread(variable, samples) %>%
  select(-case)

simulated_data$rfpred <- predict(rtrees, newdata = simulated_data, type = "prob")[,2]
simulated_data %>%
  gather(key = variable, value = value, rf_features) %>%
  ggplot(aes(x = value, y = rfpred, color = rfpred)) + 
  geom_point() + 
  facet_wrap(. ~ variable, scales = "free_x") + 
  scale_color_viridis_c("rfpred", option = "A", limits = c(0, 1))

ggplot() + 
  stat_summary_2d(data = simulated_data,
                  mapping = aes(x = ccf, y = rough_cor, z = rfpred), 
                  bins = 75, 
                  fun = mean) + 
  scale_fill_viridis_c("rfpred", option = "A", limits = c(0, 1)) + 
  geom_convexhull(data = hamby173and252_train, 
                  mapping = aes(x = ccf, y = rough_cor), 
                  alpha = 0, 
                  color = "black")

ggplot() + 
  stat_summary_2d(data = simulated_data,
                  mapping = aes(x = ccf, y = matches, z = rfpred), 
                  bins = 75, 
                  fun = mean) + 
  scale_fill_viridis_c("rfpred", option = "A", limits = c(0, 1)) + 
  geom_convexhull(data = hamby173and252_train, 
                  mapping = aes(x = ccf, y = matches), 
                  alpha = 0, 
                  color = "black")

Biplot
pca_set1 <- princomp(x = hamby224_test %>% 
                       filter(set == "Set 1") %>% 
                       select(rf_features) %>% 
                       na.omit(), 
                     cor = TRUE)
biplot(pca, col = hamby224_test %>% 
                       filter(set == "Set 1") %>% 
                       pull(rfscore))

Logistic Regressions

I have been thinking a lot about it is difficult to determine whether LIME is providing reasonable explanations, because we cannot know what the random forest is doing. That is the point of LIME. Heike suggested that we fit a logistic regression model to the training bullet data. We can then apply LIME to the logistic regression and determine if the explanations match what we would expect based on the coefficients of the logistic regression that we can interpret. She also suggested including interactions in the logistic regression model. This section includes code for fitting several logistic regression models.

Main Effects Only

The code below fits a logistic regression model to the training data using the train function from the caret package. The nine features used to fit rtrees are included as main effects only. The features are standardized prior to fitting the model using the preProcess = "scale" in train.

# Fit or load the logistic regression model with only main effects
if(!file.exists("../../../data/logistic_mains.rds")) {
  
  # Set a seed
  set.seed(20190226)
  
  # Fit the model
  logistic_mains <- caret::train(as.factor(samesource) ~ ccf + rough_cor + D + sd_D + 
                                   matches + mismatches + cms + non_cms + sum_peaks,
                                 preProcess = "scale",
                                 data = hamby173and252_train, 
                                 method = "glm", 
                                 family = "binomial")

  # Save the model
  saveRDS(logistic_mains, "../../../data/logistic_mains.rds")

} else{
  
  # Load the model
  logistic_mains <- readRDS("../../../data/logistic_mains.rds")
  
}

The output from the model is shown below. The features of ccf and matches stand out as the most important features in the model.

# Summary of the model
summary(logistic_mains)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6617  -0.0759  -0.0416  -0.0239   4.8101  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.02313    0.41486 -24.160  < 2e-16 ***
## ccf           1.52843    0.05079  30.093  < 2e-16 ***
## rough_cor    -0.06387    0.06440  -0.992 0.321335    
## D             0.78418    0.07803  10.049  < 2e-16 ***
## sd_D         -0.30758    0.07974  -3.857 0.000115 ***
## matches       1.32252    0.09455  13.987  < 2e-16 ***
## mismatches   -1.07490    0.08229 -13.062  < 2e-16 ***
## cms          -0.07989    0.06169  -1.295 0.195318    
## non_cms       0.81650    0.09165   8.909  < 2e-16 ***
## sum_peaks    -0.22886    0.07115  -3.216 0.001298 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12728.0  on 83027  degrees of freedom
## Residual deviance:  4231.3  on 83018  degrees of freedom
## AIC: 4251.3
## 
## Number of Fisher Scoring iterations: 9

All Two Way Interactions

The code below fits a logistic regression model to the training data using the train function from the caret package. The nine features used to fit rtrees are included as main effects and with all two way interactions. The features are standardized prior to fitting the model using the preProcess = "scale" in train.

# Fit or load the logistic regression model with all two way interactions
if(!file.exists("../../../data/logistic_inters2.rds")) {
  
  # Set a seed
  set.seed(20190226)
  
  # Fit the model
  logistic_inters2 <- caret::train(as.factor(samesource) ~
                                       (ccf + rough_cor + D + sd_D + matches + 
                                          mismatches + cms + non_cms + sum_peaks)^2,
                                     preProcess = "scale",
                                     data = hamby173and252_train,
                                     method = "glm", 
                                     family = "binomial")
  
  # Save the model
  saveRDS(logistic_inters2, "../../../data/logistic_inters2.rds")

} else{
  
  # Load the model
  logistic_inters2 <- readRDS("../../../data/logistic_inters2.rds")
  
}

The output from this model is not shown.

# Summary of the model
summary(logistic_inters2)

LASSO Model

# Fit or load the lasso logistic regression model
if(!file.exists("../../../data/logistic_lasso.rds")) {
  
  # Set a seed
  set.seed(20190312)

  # LASSO logistic regression model
  logistic_lasso <- cv.glmnet(x = hamby173and252_train %>% select(rf_features) %>% as.matrix(ncol = 9), 
                              y = hamby173and252_train %>% pull(samesource) %>% as.numeric(),
                              alpha = 1,
                              family = "binomial")

  # Tell R to run the upcoming code in parallel
  # plan(multiprocess)
  
  # LASSO logistic regression model with possible pairwise interactions
  # logistic_lasso <- 
  #   hierNet.logistic.path(x = hamby173and252_train %>% select(rf_features) %>% as.matrix(ncol = 9), 
  #                         y = hamby173and252_train %>% pull(samesource) %>% as.numeric(), 
  #                         lamlist = logistic_lasso$lambda)
  
  # Save the model
  saveRDS(logistic_lasso, "../../../data/logistic_lasso.rds")
  
} else {
  
  # Load the model
  logistic_lasso <- readRDS("../../../data/logistic_lasso.rds")
  
}

Comparing the Models

The results as computed by caret on the training data are shown below for the two logistic regression models. The accuracy shows that the models with two way interactions performs better on the training data.

# Results from the models
data.frame(model = c("mains", "inters2")) %>%
  bind_cols(bind_rows(logistic_mains$results, logistic_inters2$results))
##     model parameter  Accuracy     Kappa   AccuracySD    KappaSD
## 1   mains      none 0.9948668 0.7884070 0.0003055954 0.01402478
## 2 inters2      none 0.9954664 0.8207418 0.0002894706 0.01134129

The AICs from the models are shown below. The model with two way interactions performs the best based on AIC.

AIC(logistic_mains$finalModel)
## [1] 4251.283
AIC(logistic_inters2$finalModel)
## [1] 3589.68

I also wanted to know how the models perform on the test data. The models are used to make predictions on the test data and the MSEs are computed for each model. These are shown in the table below. The model with main effects only has the smallest MSE. I am guessing that the other model is overfitting the data.

data.frame(obs = hamby224_test %>% 
             pull(samesource) %>% na.omit(),
           mains = predict(logistic_mains, hamby224_test %>% select(rf_features)),
           inters2 = predict(logistic_inters2, hamby224_test %>% select(rf_features))) %>%
  mutate(mains_sqerrs = (obs - as.logical(mains))^2,
         inters2_sqerrs = (obs - as.logical(inters2))^2) %>%
  summarise_at(vars(mains_sqerrs, inters2_sqerrs), 
               function(x) sum(x) / length(x)) %>%
  rename(mains_mse = mains_sqerrs, inters2_mse = inters2_sqerrs)
##    mains_mse inters2_mse
## 1 0.01098901  0.02197802

Session Info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pdp_0.7.0           forcats_0.5.0       stringr_1.4.0      
##  [4] dplyr_1.0.2         purrr_0.3.4         readr_1.3.1        
##  [7] tidyr_1.1.2         tibble_3.0.3        tidyverse_1.3.0    
## [10] randomForest_4.6-14 plotly_4.9.2.1      glmnet_4.0-2       
## [13] Matrix_1.2-18       ggpcp_0.1.0         GGally_2.0.0       
## [16] ggplot2_3.3.2.9000  ggConvexHull_0.1.0  cowplot_1.0.0      
## [19] bulletxtrctr_0.2.0 
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1        class_7.3-17            ellipsis_0.3.1         
##   [4] fs_1.5.0                rstudioapi_0.11         farver_2.0.3           
##   [7] x3ptools_0.0.2.9000     prodlim_2019.11.13      fansi_0.4.1            
##  [10] mvtnorm_1.1-1           lubridate_1.7.9         xml2_1.3.2             
##  [13] codetools_0.2-16        splines_4.0.2           knitr_1.29             
##  [16] readbitmap_0.1.5        jsonlite_1.7.1          pROC_1.16.2            
##  [19] Cairo_1.5-12.2          caret_6.0-86            broom_0.7.0            
##  [22] dbplyr_1.4.4            png_0.1-7               shiny_1.5.0            
##  [25] compiler_4.0.2          httr_1.4.2              backports_1.1.10       
##  [28] assertthat_0.2.1        bmp_0.3                 fastmap_1.0.1          
##  [31] lazyeval_0.2.2          cli_2.0.2               later_1.1.0.1          
##  [34] htmltools_0.5.0         tools_4.0.2             igraph_1.2.5           
##  [37] gtable_0.3.0            glue_1.4.2              reshape2_1.4.4         
##  [40] Rcpp_1.0.5              cellranger_1.1.0        grooveFinder_0.0.1     
##  [43] vctrs_0.3.4             nlme_3.1-148            iterators_1.0.12       
##  [46] crosstalk_1.1.0.1       gbRd_0.4-11             timeDate_3043.102      
##  [49] gower_0.2.2             xfun_0.17               rvest_0.3.6            
##  [52] mime_0.9                miniUI_0.1.1.1          lifecycle_0.2.0        
##  [55] MASS_7.3-51.6           zoo_1.8-8               scales_1.1.1           
##  [58] ipred_0.9-9             hms_0.5.3               promises_1.1.1         
##  [61] RColorBrewer_1.1-2      yaml_2.2.1              curl_4.3               
##  [64] gridExtra_2.3           rpart_4.1-15            reshape_0.8.8          
##  [67] stringi_1.5.3           imager_0.42.3           foreach_1.5.0          
##  [70] TTR_0.24.2              tiff_0.1-5              bibtex_0.4.2.2         
##  [73] manipulateWidget_0.10.1 lava_1.6.7              shape_1.4.4            
##  [76] Rdpack_1.0.0            rlang_0.4.7             pkgconfig_2.0.3        
##  [79] rgl_0.100.54            evaluate_0.14           lattice_0.20-41        
##  [82] recipes_0.1.13          labeling_0.3            htmlwidgets_1.5.1      
##  [85] tidyselect_1.1.0        plyr_1.8.6              magrittr_1.5           
##  [88] R6_2.4.1                generics_0.0.2          DBI_1.1.0              
##  [91] pillar_1.4.6            haven_2.3.1             withr_2.2.0            
##  [94] xts_0.12.1              nnet_7.3-14             survival_3.1-12        
##  [97] modelr_0.1.8            crayon_1.3.4            rmarkdown_2.3          
## [100] jpeg_0.1-8.1            locfit_1.5-9.4          grid_4.0.2             
## [103] readxl_1.3.1            data.table_1.13.0       blob_1.2.1             
## [106] ModelMetrics_1.2.2.2    reprex_0.3.0            digest_0.6.25          
## [109] webshot_0.5.2           xtable_1.8-4            httpuv_1.5.4           
## [112] stats4_4.0.2            munsell_0.5.0           bulletcp_1.0.0         
## [115] viridisLite_0.3.0       smoother_1.1