This journal contains work to identify the data used to train rtrees (contained in both bulletr and bulletxtrctr). We recently discovered some discrepancies between rtrees and the training data we were using (hamby-comparisons.csv, previously features-hamby173and252.csv). Unfortunately, the original work by Eric is not well documented, so it is difficult to know for sure which dataset was used to train the model. However, Eric recently pointed us to some code, which he believes was used to train rtrees. Heike was able to rerun part of the code to extract data from the CSAFE database that was created by the first part of the code (scans to features). The new data (CCFs_withlands.csv) contains the same number of rows the number of predictions associated with rtrees. I have gathered all of our work trying to determine that CCFs_withlands is the data (or very similar to the data) used to train rtrees in this journal.

library(cowplot)
library(dplyr)
library(ggplot2)
library(purrr)
library(randomForest)
library(stringr)
library(tidyr)

Raw Datasets

This section contains code for loading and performing initial comparisons of the raw datasets.

Load Data

Load the data provided by Heike and CSAFE that we have been using and believed to be the training data for rtrees:

hamby_comparisons_raw <- read.csv("../../../data/raw/hamby-comparisons.csv")

Load the data extracted (on 2020-09-23) from the CSAFE database and corresponds to the R script Eric believes to be the one used to train rtrees.

CCFs_withlands_raw <- read.csv("../../../data/raw/CCFs_withlands.csv")

Obtain the features used to train rtrees for use throughout this journal:

rtrees_features <- rownames(bulletxtrctr::rtrees$importance)

Initial Comparisons

Dimensions of the raw datasets:

dim(hamby_comparisons_raw)

## [1] 84666    17

dim(CCFs_withlands_raw)

## [1] 83028    26

Note that CCFs_withlands has the same number of rows as the number of predictions associated with rtrees:

length(bulletxtrctr::rtrees$predicted)

## [1] 83028

Comparing summaries of the distributions of the features used to train rtrees from the two datasets. Note that the main differences are seen with the distributions of D (distance) and sd_D (standard deviation of distance):

summary(hamby_comparisons_raw %>% select(all_of(rtrees_features)))

##       ccf            rough_cor              D                 sd_D      
##  Min.   :0.01407   Min.   :-0.86978   Min.   : 0.05055   Min.   :0.780  
##  1st Qu.:0.21288   1st Qu.:-0.18680   1st Qu.: 2.02500   1st Qu.:2.349  
##  Median :0.27538   Median :-0.05761   Median : 2.77395   Median :2.728  
##  Mean   :0.28940   Mean   :-0.07392   Mean   : 3.31462   Mean   :2.815  
##  3rd Qu.:0.34831   3rd Qu.: 0.04942   3rd Qu.: 4.10920   3rd Qu.:3.217  
##  Max.   :0.98110   Max.   : 0.96355   Max.   :20.79681   Max.   :7.263  
##     matches         mismatches          cms             non_cms      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 1.206   1st Qu.: 8.027   1st Qu.: 0.5915   1st Qu.: 3.678  
##  Median : 2.228   Median : 9.494   Median : 1.1378   Median : 4.855  
##  Mean   : 2.444   Mean   : 9.756   Mean   : 1.3562   Mean   : 5.351  
##  3rd Qu.: 3.279   3rd Qu.:11.228   3rd Qu.: 1.7112   3rd Qu.: 6.531  
##  Max.   :18.656   Max.   :33.684   Max.   :15.2131   Max.   :33.684  
##    sum_peaks     
##  Min.   : 0.000  
##  1st Qu.: 1.634  
##  Median : 2.762  
##  Mean   : 3.027  
##  3rd Qu.: 4.102  
##  Max.   :25.990

summary(CCFs_withlands_raw %>% select(all_of(rtrees_features)))

##       ccf            rough_cor              D                  sd_D         
##  Min.   :-0.0127   Min.   :-0.89155   Min.   :0.0003513   Min.   :0.001250  
##  1st Qu.: 0.2124   1st Qu.:-0.20245   1st Qu.:0.0022353   1st Qu.:0.003698  
##  Median : 0.2750   Median :-0.06423   Median :0.0026339   Median :0.004307  
##  Mean   : 0.2890   Mean   :-0.08400   Mean   :0.0027578   Mean   :0.004430  
##  3rd Qu.: 0.3481   3rd Qu.: 0.04686   3rd Qu.:0.0032182   3rd Qu.:0.005088  
##  Max.   : 0.9811   Max.   : 0.96355   Max.   :0.0072105   Max.   :0.011110  
##     matches         mismatches          cms             non_cms      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 1.182   1st Qu.: 7.170   1st Qu.: 0.5893   1st Qu.: 3.218  
##  Median : 2.197   Median : 8.225   Median : 1.1348   Median : 4.192  
##  Mean   : 2.403   Mean   : 8.244   Mean   : 1.3432   Mean   : 4.564  
##  3rd Qu.: 3.249   3rd Qu.: 9.289   3rd Qu.: 1.7082   3rd Qu.: 5.555  
##  Max.   :18.656   Max.   :14.947   Max.   :15.2131   Max.   :14.161  
##    sum_peaks     
##  Min.   : 0.000  
##  1st Qu.: 1.566  
##  Median : 2.709  
##  Mean   : 2.982  
##  3rd Qu.: 4.062  
##  Max.   :23.775

This result is even true when the observations in hamby_comparisons known to have tank rash (flag == FALSE) are removed. This seems to suggest that the units of D and sd_D changed from CCFs_withlands:

summary(hamby_comparisons_raw %>% filter(flag == FALSE) %>% select(all_of(rtrees_features)))

##       ccf            rough_cor              D                 sd_D      
##  Min.   :0.01407   Min.   :-0.86978   Min.   : 0.05055   Min.   :0.780  
##  1st Qu.:0.21297   1st Qu.:-0.18430   1st Qu.: 2.02162   1st Qu.:2.347  
##  Median :0.27541   Median :-0.05671   Median : 2.76574   Median :2.724  
##  Mean   :0.28940   Mean   :-0.07166   Mean   : 3.29264   Mean   :2.808  
##  3rd Qu.:0.34822   3rd Qu.: 0.04999   3rd Qu.: 4.07658   3rd Qu.:3.208  
##  Max.   :0.98110   Max.   : 0.96355   Max.   :20.79681   Max.   :7.263  
##     matches         mismatches          cms             non_cms      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.: 1.209   1st Qu.: 8.027   1st Qu.: 0.5915   1st Qu.: 3.675  
##  Median : 2.230   Median : 9.494   Median : 1.1388   Median : 4.848  
##  Mean   : 2.448   Mean   : 9.756   Mean   : 1.3578   Mean   : 5.345  
##  3rd Qu.: 3.282   3rd Qu.:11.228   3rd Qu.: 1.7112   3rd Qu.: 6.525  
##  Max.   :18.656   Max.   :33.684   Max.   :15.2131   Max.   :33.684  
##    sum_peaks     
##  Min.   : 0.000  
##  1st Qu.: 1.638  
##  Median : 2.764  
##  Mean   : 3.030  
##  3rd Qu.: 4.105  
##  Max.   :25.990

Comparing visualizations of the distributions of the features used to train rtrees from the two datasets to again see the differences in distributions of D and sd_D:

bind_rows(
  hamby_comparisons_raw %>%
    filter(flag == FALSE) %>%
    select(all_of(rtrees_features)) %>%
    pivot_longer(cols = everything()) %>%
    mutate(dataset = "hamby_comparisons"),
  CCFs_withlands_raw %>% select(all_of(rtrees_features)) %>%
    pivot_longer(cols = everything()) %>%
    mutate(dataset = "CCFs_withlands")
) %>%
  ggplot(aes(x = value, fill = dataset)) + 
  geom_histogram(position = "dodge") +
  facet_wrap(. ~ name, scales = "free") + 
  theme(legend.position = "bottom")

Data Cleaning

There are discrepancies in the naming conventions of the two datasets. This section contains the code that identifies the differences and cleans the data.

Extracting Variables of Interest

The following variables are contained in the raw versions of the data. Note that they have different variables and different names. For our analysis, we only need the variables identifying the lands, the features used to train rtrees, the ground truth for whether or not the bullets are a match, and variables flagging bullets with tank rash.

names(hamby_comparisons_raw)

##  [1] "land_id1"         "land_id2"         "ccf"              "rough_cor"       
##  [5] "lag"              "abs_lag"          "D"                "sd_D"            
##  [9] "matches"          "mismatches"       "cms"              "overlap"         
## [13] "non_cms"          "sum_peaks"        "signature_length" "same_source"     
## [17] "flag"

names(CCFs_withlands_raw)

##  [1] "compare_id"       "profile1_id"      "profile2_id"      "ccf"             
##  [5] "rough_cor"        "lag"              "D"                "sd_D"            
##  [9] "signature_length" "overlap"          "matches"          "mismatches"      
## [13] "cms"              "non_cms"          "sum_peaks"        "land_id.x"       
## [17] "land_id.y"        "match"            "study.x"          "barrel.x"        
## [21] "bullet.x"         "land.x"           "study.y"          "barrel.y"        
## [25] "bullet.y"         "land.y"

The code below separates the land id variables in hamby_comparisons_raw into separate columns for study, barrel, bullet, and land and selects only the variables of interest.

hamby_comparisons_select <- 
  hamby_comparisons_raw %>%
  separate(land_id1, c("study1", "barrel1", "bullet1", "land1")) %>%
  separate(land_id2, c("study2", "barrel2", "bullet2", "land2")) %>%
  select(
    study1,
    barrel1,
    bullet1,
    land1,
    study2,
    barrel2,
    bullet2,
    land2,
    all_of(rtrees_features),
    same_source,
    flag
  )

The code below renames some of the variables in CCFs_withlands_raw, selects only the variables of interest, and converts all land labels to characters.

CCFs_withlands_select <- 
  CCFs_withlands_raw %>%
  rename(
    "study1" = "study.x",
    "barrel1" = "barrel.x",
    "bullet1" = "bullet.x",
    "land1" = "land.x",
    "study2" = "study.y",
    "barrel2" = "barrel.y",
    "bullet2" = "bullet.y",
    "land2" = "land.y",
    "same_source"= "match"
  ) %>%
  select(
    study1,
    barrel1,
    bullet1,
    land1,
    study2,
    barrel2,
    bullet2,
    land2,
    all_of(rtrees_features),
    same_source
  ) %>%
  mutate(across(c(barrel1, bullet1, land1, barrel2, bullet2, land2), as.character))

Check to make sure the structures of the two datasets agree:

hamby_comparisons_select %>% str()

## 'data.frame':    84666 obs. of  19 variables:
##  $ study1     : chr  "Hamby173" "Hamby173" "Hamby173" "Hamby173" ...
##  $ barrel1    : chr  "Br10" "Br10" "Br10" "Br10" ...
##  $ bullet1    : chr  "B1" "B1" "B1" "B1" ...
##  $ land1      : chr  "L1" "L1" "L1" "L1" ...
##  $ study2     : chr  "Hamby173" "Hamby173" "Hamby173" "Hamby173" ...
##  $ barrel2    : chr  "Br10" "Br10" "Br10" "Br10" ...
##  $ bullet2    : chr  "B1" "B1" "B1" "B1" ...
##  $ land2      : chr  "L2" "L3" "L4" "L5" ...
##  $ ccf        : num  0.187 0.256 0.233 0.349 0.207 ...
##  $ rough_cor  : num  -0.0326 0.1787 0.0808 0.1682 0.0381 ...
##  $ D          : num  2.93 1.43 2.49 2.29 2.66 ...
##  $ sd_D       : num  2.64 1.88 2.57 2.68 2.57 ...
##  $ matches    : num  3.21 3.37 2.37 2.66 1.13 ...
##  $ mismatches : num  7.48 9.54 12.43 9.56 10.69 ...
##  $ cms        : num  1.6 1.68 2.37 1.59 1.13 ...
##  $ non_cms    : num  3.74 2.8 9.47 5.31 5.63 ...
##  $ sum_peaks  : num  4.21 3.63 2.3 3.92 0.45 ...
##  $ same_source: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ flag       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

CCFs_withlands_select %>% str()

## 'data.frame':    83028 obs. of  18 variables:
##  $ study1     : chr  "Hamby252" "Hamby252" "Hamby252" "Hamby252" ...
##  $ barrel1    : chr  "10" "10" "10" "10" ...
##  $ bullet1    : chr  "1" "1" "1" "1" ...
##  $ land1      : chr  "1" "1" "1" "1" ...
##  $ study2     : chr  "Hamby252" "Hamby252" "Hamby252" "Hamby252" ...
##  $ barrel2    : chr  "10" "10" "10" "10" ...
##  $ bullet2    : chr  "1" "1" "1" "1" ...
##  $ land2      : chr  "2" "3" "4" "5" ...
##  $ ccf        : num  0.173 0.422 0.175 0.424 0.294 ...
##  $ rough_cor  : num  -0.58411 -0.41022 -0.16741 0.27833 -0.00905 ...
##  $ D          : num  0.00341 0.00332 0.00208 0.00187 0.00214 ...
##  $ sd_D       : num  0.00478 0.00479 0.00359 0.00329 0.00348 ...
##  $ matches    : num  1.09 1.22 2.19 4.09 5.41 ...
##  $ mismatches : num  6.62 9.15 5.42 7 6 ...
##  $ cms        : num  0.547 0.612 1.644 2.338 3.249 ...
##  $ non_cms    : num  4.07 8.13 4.43 2.8 1.5 ...
##  $ sum_peaks  : num  0.303 1.864 1.654 4.414 5.142 ...
##  $ same_source: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Matching Labels

The function below extracts the labels from the two datasets for a given variable name var and creates a plot to show the discrepancies between hamby_comparison_select and CCFs_withlands_select.

plot_labels <- function(var, ccf_data) {
  
  # Extract the labels
  hc_labels <-
    unique(c(
      hamby_comparisons_select %>% pull(paste0(var, "1")),
      hamby_comparisons_select %>% pull(paste0(var, "2"))
    ))
  ccfwl_labels <-
    unique(c(
      ccf_data %>% pull(paste0(var, "1")),
      ccf_data %>% pull(paste0(var, "2"))
    ))

  # Plot the labels
  data.frame(data = c(
    rep("hamby comparisons", length(hc_labels)),
    rep("CCFs with lands", length(ccfwl_labels))
  ),
  labels = c(hc_labels, ccfwl_labels)) %>%
    ggplot(aes(x = labels, y = data)) +
    geom_tile() +
    labs(x = "", y = "", title = paste("Comparing labels in variable:", var))  
}

The plots below show that the following changes need to be made to CCFs_withlands_select in order to match with hamby_comparison_select: - study label of Hamby44 needs to be changed to Hamby173 - lettered barrels need to be changed to BrUnk and Br needs to be added to other barrels - barrel letters need to be made the bullet label and B needs to be added to all bullets - L needs to be added to lands

plot_labels(var = "study", ccf_data = CCFs_withlands_select)

plot_labels(var = "barrel", ccf_data = CCFs_withlands_select)

plot_labels(var = "bullet", ccf_data = CCFs_withlands_select)

plot_labels(var = "land", ccf_data = CCFs_withlands_select)

Determine the letters in the CCFs_withlands_select unknown barrels:

all_barrel_labels <-
  unique(c(
    as.character(CCFs_withlands_select$barrel1),
    as.character(CCFs_withlands_select$barrel2)
  ))
letters <- all_barrel_labels[!(all_barrel_labels %in% 1:10)]
letters

##  [1] "B" "C" "D" "E" "F" "H" "J" "L" "M" "Q" "S" "U" "X" "Y" "Z" "A" "G" "I" "N"
## [20] "R" "V" "W"

Clean CCFs_withlands_select so the labels match bullet_train_raw:

CCFs_withlands_labelled <- 
  CCFs_withlands_select %>%
  mutate(
    study1 = ifelse(study1 == "Hamby44", "Hamby173", study1),
    study2 = ifelse(study2 == "Hamby44", "Hamby173", study2),
    bullet1 = ifelse(barrel1 %in% letters, barrel1, bullet1),
    barrel1 = ifelse(barrel1 %in% letters, "Unk", barrel1),
    bullet2 = ifelse(barrel2 %in% letters, barrel2, bullet2),
    barrel2 = ifelse(barrel2 %in% letters, "Unk", barrel2)
    ) %>%
  mutate(
    barrel1 = paste0("Br", barrel1),
    bullet1 = paste0("B", bullet1),
    land1 = paste0("L", land1),
    barrel2 = paste0("Br", barrel2),
    bullet2 = paste0("B", bullet2),
    land2 = paste0("L", land2)
  )

The plots below show that the labels are now in agreement:

plot_labels(var = "study", ccf_data = CCFs_withlands_labelled)

plot_labels(var = "barrel", ccf_data = CCFs_withlands_labelled)

plot_labels(var = "bullet", ccf_data = CCFs_withlands_labelled)

plot_labels(var = "land", ccf_data = CCFs_withlands_labelled)

Final Touches

Create the land ids for both datasets:

hamby_comparisons <-
  hamby_comparisons_select %>%
  mutate(
    land_id1 = paste(study1, barrel1, bullet1, land1, sep = "-"),
    land_id2 = paste(study2, barrel2, bullet2, land2, sep = "-")
  ) %>%
  select(land_id1, land_id2, all_of(rtrees_features), same_source, flag)

CCFs_withlands <-
  CCFs_withlands_labelled %>%
  mutate(
    land_id1 = paste(study1, barrel1, bullet1, land1, sep = "-"),
    land_id2 = paste(study2, barrel2, bullet2, land2, sep = "-")
  ) %>%
  select(land_id1, land_id2, all_of(rtrees_features), same_source)

The structures of the cleaned data:

hamby_comparisons %>% str()

## 'data.frame':    84666 obs. of  13 variables:
##  $ land_id1   : chr  "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" ...
##  $ land_id2   : chr  "Hamby173-Br10-B1-L2" "Hamby173-Br10-B1-L3" "Hamby173-Br10-B1-L4" "Hamby173-Br10-B1-L5" ...
##  $ ccf        : num  0.187 0.256 0.233 0.349 0.207 ...
##  $ rough_cor  : num  -0.0326 0.1787 0.0808 0.1682 0.0381 ...
##  $ D          : num  2.93 1.43 2.49 2.29 2.66 ...
##  $ sd_D       : num  2.64 1.88 2.57 2.68 2.57 ...
##  $ matches    : num  3.21 3.37 2.37 2.66 1.13 ...
##  $ mismatches : num  7.48 9.54 12.43 9.56 10.69 ...
##  $ cms        : num  1.6 1.68 2.37 1.59 1.13 ...
##  $ non_cms    : num  3.74 2.8 9.47 5.31 5.63 ...
##  $ sum_peaks  : num  4.21 3.63 2.3 3.92 0.45 ...
##  $ same_source: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ flag       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

CCFs_withlands %>% str()

## 'data.frame':    83028 obs. of  12 variables:
##  $ land_id1   : chr  "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" ...
##  $ land_id2   : chr  "Hamby252-Br10-B1-L2" "Hamby252-Br10-B1-L3" "Hamby252-Br10-B1-L4" "Hamby252-Br10-B1-L5" ...
##  $ ccf        : num  0.173 0.422 0.175 0.424 0.294 ...
##  $ rough_cor  : num  -0.58411 -0.41022 -0.16741 0.27833 -0.00905 ...
##  $ D          : num  0.00341 0.00332 0.00208 0.00187 0.00214 ...
##  $ sd_D       : num  0.00478 0.00479 0.00359 0.00329 0.00348 ...
##  $ matches    : num  1.09 1.22 2.19 4.09 5.41 ...
##  $ mismatches : num  6.62 9.15 5.42 7 6 ...
##  $ cms        : num  0.547 0.612 1.644 2.338 3.249 ...
##  $ non_cms    : num  4.07 8.13 4.43 2.8 1.5 ...
##  $ sum_peaks  : num  0.303 1.864 1.654 4.414 5.142 ...
##  $ same_source: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Save the cleaned data:

write.csv(x = hamby_comparisons, file = "../../../data/hamby_comparisons_clean.csv", row.names = FALSE)
write.csv(x = CCFs_withlands, file = "../../../data/CCFs_withlands_clean.csv", row.names = FALSE)

Land ID Differences

This section considers the differences in bullets included in the two datasets. The code below extracts the unique land IDs from the two datasets:

hamby_comparisons_ids = unique(c(hamby_comparisons$land_id1, hamby_comparisons$land_id2))
CCFs_withlands_ids = unique(c(CCFs_withlands$land_id1, CCFs_withlands$land_id2))

Below are the number of bullets contained in each of the datasets. There are less in CCFs_withlands than hamby_comparisons:

length(hamby_comparisons_ids)

## [1] 412

length(CCFs_withlands_ids)

## [1] 408

Identify the land ID in CCFs_withlands but not in hamby-comparisons (Heike identified that Hamby173-Br3-B2-L1 has damage on the bullet, but she is not sure why Hamby173-BrUnk-BM-L4 was left out):

CCFs_withlands_ids[!(CCFs_withlands_ids %in% hamby_comparisons_ids)]

## [1] "Hamby173-Br3-B2-L1"   "Hamby173-BrUnk-BM-L4"

Identify the land ID in hamby-comparisons but not in CCFs_withlands (Heike confirmed that bullet E is known to have issues):

hamby_comparisons_ids[!(hamby_comparisons_ids %in% CCFs_withlands_ids)]

## [1] "Hamby173-BrUnk-BE-L1" "Hamby173-BrUnk-BE-L2" "Hamby173-BrUnk-BE-L3"
## [4] "Hamby173-BrUnk-BE-L4" "Hamby173-BrUnk-BE-L5" "Hamby173-BrUnk-BE-L6"

Comparisons to rtrees

Dimensions

One of the aspects which led to the realization that the hamby-comparisons data is not the training data for rtrees is that the number of observations in the data do not agree with the number of predictions in the rtrees model. The number of observations used to train rtrees is the following:

length(bulletxtrctr::rtrees$predicted)

## [1] 83028

Here are the dimensions of the hamby_comparisons data which does not agree with the number of observations used to train rtrees:

hamby_comparisons %>% dim()

## [1] 84666    13

Even with the observations removed that are known to have tank rash (flag != FALSE), the dimensions of the hamby_comparisons data do not agree with rtrees:

hamby_comparisons %>% filter(flag == FALSE) %>% dim()

## [1] 84255    13

The code below removes observations based on Eric Hare’s code (available here), which he thinks may be the code used to train rtrees:

bullet_train_eric_filter <-
  hamby_comparisons %>%
  separate(col = "land_id1", into = c("study1", "barrel1", "bullet1", "land1")) %>%
  separate(col = "land_id2", into = c("study2", "barrel2", "bullet2", "land2")) %>%
  filter(study1 != "Hamby173" | bullet1 != "BE") %>%
  filter(study2 != "Hamby173" | bullet2 != "BE")

# Eric's code for reference:
# filter(study.y != "Hamby44" | barrel.y != "E") %>%
# filter(study.x != "Hamby44" | barrel.x != "E") %>%

Again, the dimensions do not agree with the number of observations in rtrees:

dim(bullet_train_eric_filter)

## [1] 82215    19

length(bulletxtrctr::rtrees$predicted) == dim(bullet_train_eric_filter)[1]

## [1] FALSE

In the Hare, Hofmann, and Carriquiry (2017), they describe removing four land impressions that were flagged for quality assessment:

Barrel 6 Bullet 2-1
Barrel 9 Bullet 2-4
Unknown Bullet B-2
Unknown Bullet Q-4

These lands should correspond to Hamby 252. However, it is possible to obtain the same number of observations as used to train rtrees by filtering out observations in hamby_comparisons using these barrel, bullet, and land combinations without specifying the study (that is removing observations from both Hamby 173 and Hamby 252).

hamby_comparisons_filtered <- 
  hamby_comparisons %>%
  mutate(bbl1 = str_remove(land_id1, pattern = "Hamby252-|Hamby173-"),
         bbl2 = str_remove(land_id2, pattern = "Hamby252-|Hamby173-")) %>%
  filter(!(bbl1 %in% c("Br6-B2-L1", "Br9-B2-L4", "BrUnk-BB-L2", "BrUnk-BQ-L4") | 
           bbl2 %in% c("Br6-B2-L1", "Br9-B2-L4", "BrUnk-BB-L2", "BrUnk-BQ-L4"))) %>%
  select(-bbl1, -bbl2)
hamby_comparisons_filtered %>% dim()

## [1] 83028    13

Save the cleaned and filtered data:

write.csv(x = hamby_comparisons_filtered , file = "../../../data/hamby_comparisons_filtered.csv", row.names = FALSE)

While it is possible to use hamby_comparisons to get to the same number of observations as rtrees, it is only a guess as to how the bullets were removed. On the other hand, the number of rows in CCFs_withlands already has the same number of rows as rtrees:

CCFs_withlands %>% dim()

## [1] 83028    12

Note that if the code used in Eric Hare’s R script, which he thinks may be the code used to train rtrees), is used to filter CCFs_withlands, the number of rows does not change!

CCFs_withlands %>%
  separate(col = "land_id1", into = c("study1", "barrel1", "bullet1", "land1")) %>%
  separate(col = "land_id2", into = c("study2", "barrel2", "bullet2", "land2")) %>%
  filter(study1 != "Hamby173" | bullet1 != "BE") %>%
  filter(study2 != "Hamby173" | bullet2 != "BE") %>% 
  dim()

## [1] 83028    18

Predictions

It is not possible to directly compare the internal predictions from rtrees (predict(bulletxtrctr::rtrees, type = "prob")[,2])) to those obtained by using rtrees to make predictions on the hamby_comparisons and CCFs_withlands data, because those obtained using the predict function from randomForest are the out-of-bag (OOB) predictions. The predictions obtained by using the predict function from randomForest to obtain predictions on a new dataset are created using all the training data. From the documentation for randomForest:::predict.randomForest:

newdata: a data frame or matrix containing new data. (Note: If not given, the out-of-bag prediction in object is returned.

This was determined by Heike based on the following code. She pointed out, “Have a look at the code below - we would expect a line of identity in the last plot. Sorting values also does not help (that would help in case the observations got re-ordered internally in some way).”

x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
z <- rbinom(1000,1, 0.4)

dframe <- data.frame(x1,x2,x3,z)

rf1 <- randomForest(factor(z)~x1+x2+x3, data = dframe)

dframe$predict <- predict(rf1, newdata=dframe, type="prob")[,2]
dframe$rfscore <- predict(rf1, type="prob")[,2]

dframe %>% 
  ggplot(aes(x = predict, y = rfscore)) + geom_point()

However, we may still gain some insights by comparing the distributions of the predictions. The code below extracts the predictions on the training data from rtrees and uses rtrees to make predictions on hamby_comparisons and CCFs_withlands:

rtrees_pred = predict(bulletxtrctr::rtrees, type = "prob")[,2]
hc_pred = predict(bulletxtrctr::rtrees, hamby_comparisons %>% select(all_of(rtrees_features)), type = "prob")[,2]
ccfwl_pred = predict(bulletxtrctr::rtrees, CCFs_withlands %>% select(all_of(rtrees_features)), type = "prob")[,2]

all_pred <-
  data.frame(
    dataset = c(
      rep("Internal (OOB)", length(rtrees_pred)),
      rep("Hamby comparisons", length(hc_pred)),
      rep("CCFs with lands", length(ccfwl_pred))
    ),
    pred = c(rtrees_pred, hc_pred, ccfwl_pred)
  )

Plot showing distributions of three set of predictions from rtrees: (light blue) internal predictions from rtrees, (blue) predictions obtained by applying rtrees to the observations in the CCFs_withlands data, and (purple) predictions obtained by applying rtrees to the observations in the hamby_comparisons data. Note that even though the internal predictions from rtrees are OOB and those computed using CCFs_withlands are not OOB, the distributions are very similar. On the other hand, the predictions from hamby_comparions are very different. This provides evidence that rtrees was trained on CCFs_withlands, and the change in units of D and sd_D in hamby_comparisons is apparent by the much different distribution of predictions from rtrees:

all_pred %>%
  ggplot(aes(x = pred, fill = dataset, color = dataset)) +
  geom_density(alpha = 0.5, size = 1) +
  scale_fill_manual(values = c("blue", "purple", "lightblue")) + 
  scale_color_manual(values = c("blue", "purple", "lightblue")) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(x = "rtrees predictions",
       y = "Density",
       fill = "Data predictions computed on",
       color = "Data predictions computed on")

The four plots below show direct comparisons of predictions between the internal rtrees predictions and those computed on hamby_comparisons and CCFs_withlands using rtrees. There are two versions of the plots for each of the possible training datasets. One plots the predictions without adjusting the order of the predictions in any way. The other plot sorts the predictions from lowest to highest (for a dataset) in case it helps to see a direct relationship. None of the figures show a 1-to-1 relationship, but notice that the plot with the title “Comparing internal to CCFs with lands (not sorted)” looks very similar to the one created by Heike with the example data and random forest. This is additional evidence that CCFs_withlands was used to train rtrees since even without sorting the predictions, the relationship between the OOB and new predictions is very similar to an example where we know the same training data was used.

plot_grid(
  all_pred %>%
    filter(dataset != "CCFs with lands") %>%
    group_by(dataset) %>%
    mutate(case = 1:n()) %>%
    pivot_wider(names_from = "dataset", values_from = "pred") %>%
    ggplot(aes(x = `Hamby comparisons`, y = `Internal (OOB)`)) +
    geom_point() + 
    theme(aspect.ratio = 1) +
    labs(title = "Comparing internal to Hamby Comparisons (not sorted)"),
  all_pred %>%
    filter(dataset != "CCFs with lands") %>%
    group_by(dataset) %>%
    arrange(pred) %>%
    mutate(case = 1:n()) %>%
    pivot_wider(names_from = "dataset", values_from = "pred") %>%
    ggplot(aes(x = `Hamby comparisons`, y = `Internal (OOB)`)) +
    geom_point() + 
    theme(aspect.ratio = 1) +
    labs(title = "Comparing internal to Hamby Comparisons (sorted)"),
  all_pred %>%
    filter(dataset != "Hamby comparisons") %>%
    group_by(dataset) %>%
    mutate(case = 1:n()) %>%
    pivot_wider(names_from = "dataset", values_from = "pred") %>%
    ggplot(aes(x = `CCFs with lands`, y = `Internal (OOB)`)) +
    geom_point() + 
    theme(aspect.ratio = 1) +
    labs(title = "Comparing internal to CCFs with lands (not sorted)"), 
  all_pred %>%
    filter(dataset != "Hamby comparisons") %>%
    group_by(dataset) %>%
    arrange(pred) %>%
    mutate(case = 1:n()) %>%
    pivot_wider(names_from = "dataset", values_from = "pred") %>%
    ggplot(aes(x = `CCFs with lands`, y = `Internal (OOB)`)) +
    geom_point() + 
    theme(aspect.ratio = 1) +
    labs(title = "Comparing internal to CCFs with lands (sorted)")
)

Training New Random Forests

In this section, we retrain various random forest models using both the Hamby comparison and CCFs with lands data. Then we compare the results from the retrained models to rtrees to check for similarity. The results show that the retrained models on the CCFs with lands data are very similar to rtrees while the model trained on the Hamby comparison data are very different. It should be noted that all models trained in this section on the Hamby comparison data used the observations in the filtered version of the dataset (hamby_comparisons_filtered) so that the number of observations match rtrees and since this was the data we were previously using in the LIME diagnostics paper.

Old Seeds

Unfortunately, the seed that was used to train rtrees was lost or the changes in the way the random number generation is done in R has changed has made it not possible to reproduce rtrees. However, Heike has two seeds that were used around the time of training rtrees The seeds used were selected by Heike based on ones she believes were used at the time of training rtrees: 20140501 and 20170222. Note that 20170222 is also the seed used in Eric’s code.

Below, four random forest models are trained using the seeds suggested by Heike. The code used to fit the models is based on Eric’s code, which is believed to be the code used to train the original rtrees. The first two random forest are fit using CCFs_withlands, and the third and fourth models are fit using hamby_comparisons_filtered:

rtrees2_file = "../../../data/rfs_rtrees2.rds"
if (!file.exists(rtrees2_file)) {
  set.seed(20140501)
  rtrees2 <-
    randomForest(
      factor(same_source) ~ ., 
      data = CCFs_withlands %>% select(rtrees_features, same_source), 
      ntree = 300
    )
  saveRDS(object = rtrees2, file = rtrees2_file)
} 

rtrees3_file = "../../../data/rfs_rtrees3.rds"
if (!file.exists(rtrees3_file)) {
  set.seed(20170222)
  rtrees3 <-
    randomForest(
      factor(same_source) ~ ., 
      data = CCFs_withlands %>% select(rtrees_features, same_source), 
      ntree = 300
    )
  saveRDS(object = rtrees3, file = rtrees3_file)
} 

rtrees4_file = "../../../data/rfs_rtrees4.rds"
if (!file.exists(rtrees4_file)) {
  set.seed(20140501)
  rtrees4 <-
    randomForest(
      factor(same_source) ~ .,
      data = hamby_comparisons_filtered %>% select(rtrees_features, same_source),
      ntree = 300
    )
  saveRDS(object = rtrees4, file = rtrees4_file)
}

rtrees5_file = "../../../data/rfs_rtrees5.rds"
if (!file.exists(rtrees5_file)) {
  set.seed(20170222)
  rtrees5 <-
    randomForest(
      factor(same_source) ~ .,
      data = hamby_comparisons_filtered %>% select(rtrees_features, same_source),
      ntree = 300
    )
  saveRDS(object = rtrees5, file = rtrees5_file)
}

Load the saved versions of the models:

rtrees2 <- readRDS(rtrees2_file)
rtrees3 <- readRDS(rtrees3_file)
rtrees4 <- readRDS(rtrees4_file)
rtrees5 <- readRDS(rtrees5_file)

Extract the votes from each of the newly trained random forest models and rtrees and prepare the dataframe for plotting:

rtrees_votes <-
  data.frame(
    rtrees = bulletxtrctr::rtrees$votes,
    rtrees2 = rtrees2$votes,
    rtrees3 = rtrees3$votes,
    rtrees4 = rtrees4$votes,
    rtrees5 = rtrees5$votes
  ) %>%
  mutate(obs = 1:n()) %>%
  pivot_longer(cols = -obs) %>%
  separate(name, c("model", "label")) %>%
  pivot_wider(names_from = model, values_from = value)

Plots comparing the votes from rtrees to those from the newly trained models. The top row contains the comparisons with the random forests trained on CCFs_withlands. Note that the first two plots comparing the CCFs_withlands newly trained random forest vote to rtrees looks very similar to the third plots which compares the votes from the two newly trained random forests on CCFs_withlands. This suggests that even though all three models are trained using different seeds, the training data is the same. On the other hand, the bottom row of plots are created using the votes from the random forests trained on hamby_comparisons_filtered, and the two plots comparing the votes to rtrees look very different from the third plot comparing the votes from the two newly trained models. This suggests that the hamby_comparisons_filtered was not the training data for rtrees, or at least, the observations are not ordered the same in the dataset.

plot_grid(
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees, y = rtrees2, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees2 votes",
         title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20140501"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees, y = rtrees3, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees3 votes",
         title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20170222"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees2, y = rtrees3, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees2 votes",
         title = "Both retrained RFs on Hamby \ncomparison data"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees, y = rtrees4, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees4 votes",
         title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20140501"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees, y = rtrees5, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees5 votes",
         title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20170222"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = rtrees4, y = rtrees5, color = label
    )) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees5 votes",
         title = "Both retrained RFs on CCFs with \nlands data"),
  nrow = 2
)

To address the issue that the observations in CCFs_withlands and hamby_comparisons may not be ordered the same way as the training data for rtrees, the set of plots are recreated by first ordering all votes. If the training data and seed are the same as those used to train rtrees, we would expect a 1-to-1 relationship. 1-to-1 lines are included in the plots, and none of the votes result in an exact 1-to-1 relationship.

plot_grid(
  rtrees_votes %>%
    ggplot(aes(
      x = sort(rtrees), y = sort(rtrees2), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees2 votes",
         title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20140501"),
  
  rtrees_votes%>%
    ggplot(aes(
      x = sort(rtrees), y = sort(rtrees3), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees3 votes",
         title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20170222"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = sort(rtrees2), y = sort(rtrees3), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees2 votes",
         title = "Both retrained RFs on Hamby \ncomparison data"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = sort(rtrees), y = sort(rtrees4), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees4 votes",
         title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20140501"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = sort(rtrees), y = sort(rtrees5), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees5 votes",
         title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20170222"),
  
  rtrees_votes %>%
    ggplot(aes(
      x = sort(rtrees4), y = sort(rtrees5), color = label
    )) +
    geom_abline(slope = 1, intercept = 0) +
    geom_point(shape = 1) +
    labs(x = "rtrees votes",
         y = "rtrees5 votes",
         title = "Both retrained RFs on CCFs with \nlands data"),
  nrow = 2
)

The confusion matrices from rtrees and the four newly trained random forests (note that the models trained using CCFs_withlands result in much more similar confusion matrices to rtrees than the models trained using hamby_comparisons_filtered):

bulletxtrctr::rtrees$confusion

##       FALSE TRUE class.error
## FALSE 81799   21 0.000256661
## TRUE    363  845 0.300496689

rtrees2$confusion

##       FALSE TRUE class.error
## FALSE 81800   20 0.000244439
## TRUE    365  843 0.302152318

rtrees3$confusion

##       FALSE TRUE  class.error
## FALSE 81798   22 0.0002688829
## TRUE    365  843 0.3021523179

rtrees4$confusion

##       FALSE TRUE  class.error
## FALSE 81781   26 0.0003178212
## TRUE    362  859 0.2964782965

rtrees5$confusion

##       FALSE TRUE  class.error
## FALSE 81781   26 0.0003178212
## TRUE    362  859 0.2964782965

Distribution Across Seeds

To better understand how the random forests trained on CCFs_withlands and hamby_comparisons compare to rtrees, we compare the true and false class errors from rtrees to the distributions of the class errors by train 250 new random forests using different seeds.

Training the Random Forests

Randomly select 100 seeds (using a random seed):

set.seed(20200925)
seeds = sample(x = 10000000:100000000, size = 250, replace = FALSE)
seeds

##   [1] 86913378 58623963 69345454 34325663 93340270 75212622 86507087 90123815
##   [9] 71839799 97952423 66826663 15417032 34658265 82879888 12859647 61996604
##  [17] 38746059 47034293 43393050 39776708 59160619 17946227 64514019 55977163
##  [25] 47025646 36323333 89101849 95535796 50362498 21377586 56064809 26290638
##  [33] 12327314 10144965 71574280 15735087 84044499 26299997 94945070 66857652
##  [41] 91961425 25397981 70053483 49962778 84928548 19918913 86011458 56966578
##  [49] 57771813 18054490 19700822 47483110 65939696 21715270 41765538 69757616
##  [57] 47194876 21447681 26390822 64210271 81197166 66043729 17815883 61658035
##  [65] 98388300 99077711 88081633 23797734 46432253 17816143 11232960 20004908
##  [73] 28755517 61203762 96391718 19032140 28816211 35886423 36282551 70763252
##  [81] 50221180 98495147 55989623 48069186 19963605 41462667 51193065 57519645
##  [89] 41551727 72436817 71512124 17310655 97049263 10760057 81654872 93157177
##  [97] 91992982 78166428 46432321 18996264 54784380 31992762 10084930 99514381
## [105] 26654931 13756568 99791999 29281341 52525731 59373499 55147781 86836285
## [113] 32619985 91042212 70399262 18773975 72773625 67386934 31223054 74003200
## [121] 83585833 22082283 25488670 83236121 80096860 72321559 36009899 28604135
## [129] 51148949 23909072 54980796 27418435 44149732 43045289 49286495 69900813
## [137] 76657189 25280379 25041524 93256108 23930232 60008056 89746512 15768164
## [145] 77241435 52294490 47838322 95740015 28847963 26844312 46550912 62222916
## [153] 13050488 94107022 47810300 91293255 13766992 38276674 37065798 39835645
## [161] 97061148 15088888 81706146 75450971 92688275 25087482 70884739 36408159
## [169] 92850568 19290261 49375845 94010272 14333152 22087912 47757289 25786624
## [177] 93248929 96394697 84908963 57537701 21458897 77364056 83614296 67956024
## [185] 46097998 25999040 74085256 45822108 34358102 80149248 62496134 51713712
## [193] 70151459 40443297 54596549 45427926 36737207 98559493 17322030 49110873
## [201] 46431672 36575340 68202442 13775894 45487068 94735951 95841413 97396988
## [209] 66197257 80904145 98418620 32400293 69509327 14673738 82721809 98605019
## [217] 67574863 94816482 94214457 90435834 88020195 27895413 70722087 78699141
## [225] 55262296 87406460 74513954 48174839 67340064 69449737 91798404 76923380
## [233] 16693748 44059739 91888247 47276085 62736569 46192477 53336257 39344887
## [241] 48884389 26586038 94705726 58141716 40422530 41992717 99088978 94988816
## [249] 37316596 23584061

Function for fitting random forests to mimic rtrees with different seeds:

fit_rf <- function(seed, dataset) {
  set.seed(seed)
  randomForest(
    y = factor(dataset$same_source),
    x = dataset %>% select(all_of(rtrees_features)),
    ntree = bulletxtrctr::rtrees$ntree,
    mtry = bulletxtrctr::rtrees$mtry
  )
}

Fit 250 new random forests on hamby_comparisons_filtered and CCFs_withlands using the seeds generated above:

rfs_hc_file = "../../../data/rfs_hamby_comparison.rds"
if (!file.exists(rfs_hc_file)) {
  saveRDS(
    object = map(.x = seeds, .f = fit_rf, dataset = hamby_comparisons_filtered),
    file = rfs_hc_file
  )
} 
rfs_ccfwl_file = "../../../data/rfs_CCFs_withlands.rds"
if (!file.exists(rfs_ccfwl_file)) {
  saveRDS(
    object = map(.x = seeds, .f = fit_rf, dataset = CCFs_withlands),
    file = rfs_ccfwl_file
  )
}

Load the saved versions of the random forest models:

rfs_hamby_comparison = readRDS(rfs_hc_file)
rfs_CCFs_withlands = readRDS(rfs_ccfwl_file)

Extracting Model Results

Function for extracting the confusion matrices from the model and put in a nice data frame:

extract_confus <- function(model) {
  data.frame(model$confusion) %>%
    rename("FALSE" = "FALSE.", "TRUE" = "TRUE.", "class_error" = "class.error") %>%
    mutate(observed = rownames(model$confusion)) %>%
    pivot_wider(names_from = observed, values_from = everything())  
}

Extract the confusion matrices and importance values from all random forest models:

hamby_comparison_confus <-
  map_df(.x = rfs_hamby_comparison, .f = extract_confus, .id = "rf_model_id")
CCFs_withlands_confus <-
  map_df(.x = rfs_CCFs_withlands, .f = extract_confus, .id = "rf_model_id")

Function for extracting the importance values, computing the rank of the features based on importance, and putting in a data frame:

extract_importance <- function(model) {
  data.frame(model$importance) %>%
    mutate(features = rownames(model$importance)) %>%
    arrange(desc(MeanDecreaseGini)) %>%
    mutate(rank = 1:n()) %>%
    rename("importance" = "MeanDecreaseGini") %>%
    select(features, importance, rank)
}

Extract the feature importance from the new models:

hamby_comparison_importance <-
  map_df(.x = rfs_hamby_comparison, .f = extract_importance, .id = "rf_model_id") %>%
  mutate(scenario = "Hamby Comparisons")
CCFs_withlands_importance <-
  map_df(.x = rfs_CCFs_withlands, .f = extract_importance, .id = "rf_model_id") %>%
  mutate(scenario = "CCFs with lands")

Extract the feature importance from rtrees:

rtrees_importance <- 
  data.frame(features = rownames(bulletxtrctr::rtrees$importance),
             bulletxtrctr::rtrees$importance) %>%
  arrange(desc(MeanDecreaseGini)) %>%
  mutate(rank = 1:n(),
         rf_model_id = "rtrees", 
         scenario = "rtrees") %>%
  rename("importance" = "MeanDecreaseGini") %>%
  select(rf_model_id, features, importance, rank, scenario)

Class Error Distributions

Obtain the class errors from rtrees and the paper random forest:

rtrees_values <-
  data.frame(
    model = "rtrees",
    class_error_true = bulletxtrctr::rtrees$confusion[2, 3],
    class_error_false = bulletxtrctr::rtrees$confusion[1, 3]
  )

Histograms of the class errors (true and false) show that the rtrees class errors are in the tails of the distributions of the class errors from the models trained on the hamby_comparisons data but closer to the center of the class error distributions from the models trained on the CCFs_withlands data:

hist_true_hc <-
  hamby_comparison_confus %>%
  ggplot(aes(x = class_error_TRUE)) +
  geom_histogram(bins = 20) +
  geom_vline(data = rtrees_values,
             mapping = aes(xintercept = class_error_true, color = model)) + 
    labs(title = "Models Trained on Hamby Comparison Data")

hist_false_hc <- 
  hamby_comparison_confus %>%
  ggplot(aes(x = class_error_FALSE)) +
  geom_histogram(bins = 20) +
  geom_vline(data = rtrees_values,
             mapping = aes(xintercept = class_error_false, color = model)) + 
    labs(title = "Models Trained on Hamby Comparison Data")

hist_true_ccfwl <-
  CCFs_withlands_confus %>%
  ggplot(aes(x = class_error_TRUE)) +
  geom_histogram(bins = 20) +
  geom_vline(data = rtrees_values,
             mapping = aes(xintercept = class_error_true, color = model)) +
  labs(title = "Models Trained on CCFs with Lands Data")

hist_false_ccfwl <- 
  CCFs_withlands_confus %>%
  ggplot(aes(x = class_error_FALSE)) +
  geom_histogram(bins = 20) +
  geom_vline(data = rtrees_values,
             mapping = aes(xintercept = class_error_false, color = model)) +
  labs(title = "Models Trained on CCFs with Lands Data")

plot_grid(hist_true_hc, hist_false_hc, hist_true_ccfwl, hist_false_ccfwl)

Plots of true versus false class errors:

plot_grid(
  ggplot() +
    geom_point(data = hamby_comparison_confus, aes(x = class_error_FALSE, y = class_error_TRUE)) +
    geom_point(
      data = rtrees_values,
      aes(x = class_error_false, y = class_error_true, color = model)
    ) + 
    labs(title = "Models Trained on Hamby Comparison Data") + 
    theme(legend.position = "bottom"),
  ggplot() +
    geom_point(data = CCFs_withlands_confus, aes(x = class_error_FALSE, y = class_error_TRUE)) +
    geom_point(
      data = rtrees_values,
      aes(x = class_error_false, y = class_error_true, color = model)
    ) + 
    labs(title = "Models Trained on CCFs with Lands Data") + 
    theme(legend.position = "bottom")
)

Comparing Feature Importance

Heatmap depicting feature importance across models (note that rtrees importance ranks for D and non_cms are more similar to random forests trained using CCFs_withlands than hamby_comparisons):

bind_rows(rtrees_importance, hamby_comparison_importance, CCFs_withlands_importance) %>%
    mutate(
      scenario = factor(scenario, levels = c("rtrees", "Hamby Comparisons", "CCFs with lands")),
      rf_model_id = factor(rf_model_id, levels = c("rtrees", 1:length(seeds))),
      rank = factor(rank, levels = 1:9),
      features = factor(features, levels = rev(rtrees_importance$features))
    ) %>%
    ggplot(aes(x = rf_model_id, y = features, fill = rank)) +
    geom_tile() +
    facet_grid(. ~ scenario, scales = "free_x") +
    scale_fill_brewer(palette = "RdYlBu") + 
    theme_bw() +
    labs(
      x = "Random Forest Model", 
      y = "Feature", 
      title = "Comparing Feature Importance to rtrees"
    )

Plots of features versus importance ranks (importance ranks vary more in the middle of the features for the models trained using hamby_comparisons and near the tail end of the features for the models trained using CCFs_withlands):

plot_grid(
  bind_rows(rtrees_importance, hamby_comparison_importance) %>%
    mutate(
      rf_model_id = factor(rf_model_id, levels = c("rtrees", 1:length(seeds))),
      rank = factor(rank, levels = 1:9),
      features = factor(features, levels = rev(rtrees_importance$features)),
      rtrees_ind = ifelse(rf_model_id == "rtrees", "rtrees", "Hamby Comparison")
    ) %>%
    arrange(desc(rf_model_id)) %>%
    ggplot(
      aes(
        x = rank,
        y = features,
        group = rf_model_id,
        color = rtrees_ind,
        size = rtrees_ind,
        alpha = rtrees_ind
      )
    ) +
    geom_line() +
    scale_color_manual(values = c("black", "purple")) +
    theme_bw() +
    labs(
      x = "Feature Importance Rank",
      y = "Feature", 
      title = "Comparing rtrees importance to random forests fit \nusing Hamby comparisons data"
    ), 
    bind_rows(rtrees_importance, CCFs_withlands_importance) %>%
    mutate(
      rf_model_id = factor(rf_model_id, levels = c("rtrees", 1:length(seeds))),
      rank = factor(rank, levels = 1:9),
      features = factor(features, levels = rev(rtrees_importance$features)),
      rtrees_ind = ifelse(rf_model_id == "rtrees", "rtrees", "CCFs with lands")
    ) %>%
    arrange(desc(rf_model_id)) %>%
    ggplot(
      aes(
        x = rank,
        y = features,
        group = rf_model_id,
        color = rtrees_ind,
        size = rtrees_ind,
        alpha = rtrees_ind
      )
    ) +
    geom_line() +
    scale_color_manual(values = c("black", "purple")) +
    theme_bw() +
    labs(x = "Feature Importance Rank",
         y = "Feature",
         title = "Comparing rtrees importance to random forests fit \nusing CCFs with lands data")
)

Session Info

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.1.2         stringr_1.4.0       randomForest_4.6-14
## [4] purrr_0.3.4         ggplot2_3.3.2.9000  dplyr_1.0.2        
## [7] cowplot_1.0.0      
## 
## loaded via a namespace (and not attached):
##  [1] rgl_0.100.54            Rcpp_1.0.5              locfit_1.5-9.4         
##  [4] mvtnorm_1.1-1           lattice_0.20-41         zoo_1.8-8              
##  [7] png_0.1-7               assertthat_0.2.1        digest_0.6.25          
## [10] mime_0.9                R6_2.4.1                imager_0.42.3          
## [13] tiff_0.1-5              bulletxtrctr_0.2.0      bulletcp_1.0.0         
## [16] evaluate_0.14           pillar_1.4.6            Rdpack_1.0.0           
## [19] rlang_0.4.7             curl_4.3                miniUI_0.1.1.1         
## [22] TTR_0.24.2              bmp_0.3                 rmarkdown_2.3          
## [25] labeling_0.3            webshot_0.5.2           readr_1.3.1            
## [28] x3ptools_0.0.2.9000     htmlwidgets_1.5.1       igraph_1.2.5           
## [31] munsell_0.5.0           shiny_1.5.0             compiler_4.0.2         
## [34] httpuv_1.5.4            xfun_0.17               pkgconfig_2.0.3        
## [37] htmltools_0.5.0         readbitmap_0.1.5        tidyselect_1.1.0       
## [40] tibble_3.0.3            crayon_1.3.4            withr_2.2.0            
## [43] later_1.1.0.1           MASS_7.3-51.6           grid_4.0.2             
## [46] jsonlite_1.7.1          xtable_1.8-4            gtable_0.3.0           
## [49] lifecycle_0.2.0         magrittr_1.5            scales_1.1.1           
## [52] bibtex_0.4.2.2          stringi_1.5.3           farver_2.0.3           
## [55] promises_1.1.1          smoother_1.1            xml2_1.3.2             
## [58] xts_0.12.1              ellipsis_0.3.1          generics_0.0.2         
## [61] vctrs_0.3.4             RColorBrewer_1.1-2      tools_4.0.2            
## [64] manipulateWidget_0.10.1 glue_1.4.2              hms_0.5.3              
## [67] crosstalk_1.1.0.1       jpeg_0.1-8.1            fastmap_1.0.1          
## [70] yaml_2.2.1              colorspace_1.4-1        gbRd_0.4-11            
## [73] grooveFinder_0.0.1      knitr_1.29

Identifying `rtrees` Training Data

Katherine Goode

September 30, 2020

Raw Datasets

Load Data

Initial Comparisons

Data Cleaning

Extracting Variables of Interest

Matching Labels

Final Touches

Land ID Differences

Comparisons to rtrees

Dimensions

Predictions

Training New Random Forests

Old Seeds

Distribution Across Seeds

Training the Random Forests

Extracting Model Results

Class Error Distributions

Comparing Feature Importance

Session Info

Identifying rtrees Training Data

Katherine Goode

September 30, 2020

Raw Datasets

Load Data

Initial Comparisons

Data Cleaning

Extracting Variables of Interest

Matching Labels

Final Touches

Land ID Differences

Comparisons to rtrees

Dimensions

Predictions

Training New Random Forests

Old Seeds

Distribution Across Seeds

Training the Random Forests

Extracting Model Results

Class Error Distributions

Comparing Feature Importance

Session Info

Identifying `rtrees` Training Data