rtrees Training DataThis journal contains work to identify the data used to train rtrees (contained in both bulletr and bulletxtrctr). We recently discovered some discrepancies between rtrees and the training data we were using (hamby-comparisons.csv, previously features-hamby173and252.csv). Unfortunately, the original work by Eric is not well documented, so it is difficult to know for sure which dataset was used to train the model. However, Eric recently pointed us to some code, which he believes was used to train rtrees. Heike was able to rerun part of the code to extract data from the CSAFE database that was created by the first part of the code (scans to features). The new data (CCFs_withlands.csv) contains the same number of rows the number of predictions associated with rtrees. I have gathered all of our work trying to determine that CCFs_withlands is the data (or very similar to the data) used to train rtrees in this journal.
library(cowplot)
library(dplyr)
library(ggplot2)
library(purrr)
library(randomForest)
library(stringr)
library(tidyr)
This section contains code for loading and performing initial comparisons of the raw datasets.
Load the data provided by Heike and CSAFE that we have been using and believed to be the training data for rtrees:
hamby_comparisons_raw <- read.csv("../../../data/raw/hamby-comparisons.csv")
Load the data extracted (on 2020-09-23) from the CSAFE database and corresponds to the R script Eric believes to be the one used to train rtrees.
CCFs_withlands_raw <- read.csv("../../../data/raw/CCFs_withlands.csv")
Obtain the features used to train rtrees for use throughout this journal:
rtrees_features <- rownames(bulletxtrctr::rtrees$importance)
Dimensions of the raw datasets:
dim(hamby_comparisons_raw)
## [1] 84666 17
dim(CCFs_withlands_raw)
## [1] 83028 26
Note that CCFs_withlands has the same number of rows as the number of predictions associated with rtrees:
length(bulletxtrctr::rtrees$predicted)
## [1] 83028
Comparing summaries of the distributions of the features used to train rtrees from the two datasets. Note that the main differences are seen with the distributions of D (distance) and sd_D (standard deviation of distance):
summary(hamby_comparisons_raw %>% select(all_of(rtrees_features)))
## ccf rough_cor D sd_D
## Min. :0.01407 Min. :-0.86978 Min. : 0.05055 Min. :0.780
## 1st Qu.:0.21288 1st Qu.:-0.18680 1st Qu.: 2.02500 1st Qu.:2.349
## Median :0.27538 Median :-0.05761 Median : 2.77395 Median :2.728
## Mean :0.28940 Mean :-0.07392 Mean : 3.31462 Mean :2.815
## 3rd Qu.:0.34831 3rd Qu.: 0.04942 3rd Qu.: 4.10920 3rd Qu.:3.217
## Max. :0.98110 Max. : 0.96355 Max. :20.79681 Max. :7.263
## matches mismatches cms non_cms
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 1.206 1st Qu.: 8.027 1st Qu.: 0.5915 1st Qu.: 3.678
## Median : 2.228 Median : 9.494 Median : 1.1378 Median : 4.855
## Mean : 2.444 Mean : 9.756 Mean : 1.3562 Mean : 5.351
## 3rd Qu.: 3.279 3rd Qu.:11.228 3rd Qu.: 1.7112 3rd Qu.: 6.531
## Max. :18.656 Max. :33.684 Max. :15.2131 Max. :33.684
## sum_peaks
## Min. : 0.000
## 1st Qu.: 1.634
## Median : 2.762
## Mean : 3.027
## 3rd Qu.: 4.102
## Max. :25.990
summary(CCFs_withlands_raw %>% select(all_of(rtrees_features)))
## ccf rough_cor D sd_D
## Min. :-0.0127 Min. :-0.89155 Min. :0.0003513 Min. :0.001250
## 1st Qu.: 0.2124 1st Qu.:-0.20245 1st Qu.:0.0022353 1st Qu.:0.003698
## Median : 0.2750 Median :-0.06423 Median :0.0026339 Median :0.004307
## Mean : 0.2890 Mean :-0.08400 Mean :0.0027578 Mean :0.004430
## 3rd Qu.: 0.3481 3rd Qu.: 0.04686 3rd Qu.:0.0032182 3rd Qu.:0.005088
## Max. : 0.9811 Max. : 0.96355 Max. :0.0072105 Max. :0.011110
## matches mismatches cms non_cms
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 1.182 1st Qu.: 7.170 1st Qu.: 0.5893 1st Qu.: 3.218
## Median : 2.197 Median : 8.225 Median : 1.1348 Median : 4.192
## Mean : 2.403 Mean : 8.244 Mean : 1.3432 Mean : 4.564
## 3rd Qu.: 3.249 3rd Qu.: 9.289 3rd Qu.: 1.7082 3rd Qu.: 5.555
## Max. :18.656 Max. :14.947 Max. :15.2131 Max. :14.161
## sum_peaks
## Min. : 0.000
## 1st Qu.: 1.566
## Median : 2.709
## Mean : 2.982
## 3rd Qu.: 4.062
## Max. :23.775
This result is even true when the observations in hamby_comparisons known to have tank rash (flag == FALSE) are removed. This seems to suggest that the units of D and sd_D changed from CCFs_withlands:
summary(hamby_comparisons_raw %>% filter(flag == FALSE) %>% select(all_of(rtrees_features)))
## ccf rough_cor D sd_D
## Min. :0.01407 Min. :-0.86978 Min. : 0.05055 Min. :0.780
## 1st Qu.:0.21297 1st Qu.:-0.18430 1st Qu.: 2.02162 1st Qu.:2.347
## Median :0.27541 Median :-0.05671 Median : 2.76574 Median :2.724
## Mean :0.28940 Mean :-0.07166 Mean : 3.29264 Mean :2.808
## 3rd Qu.:0.34822 3rd Qu.: 0.04999 3rd Qu.: 4.07658 3rd Qu.:3.208
## Max. :0.98110 Max. : 0.96355 Max. :20.79681 Max. :7.263
## matches mismatches cms non_cms
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 1.209 1st Qu.: 8.027 1st Qu.: 0.5915 1st Qu.: 3.675
## Median : 2.230 Median : 9.494 Median : 1.1388 Median : 4.848
## Mean : 2.448 Mean : 9.756 Mean : 1.3578 Mean : 5.345
## 3rd Qu.: 3.282 3rd Qu.:11.228 3rd Qu.: 1.7112 3rd Qu.: 6.525
## Max. :18.656 Max. :33.684 Max. :15.2131 Max. :33.684
## sum_peaks
## Min. : 0.000
## 1st Qu.: 1.638
## Median : 2.764
## Mean : 3.030
## 3rd Qu.: 4.105
## Max. :25.990
Comparing visualizations of the distributions of the features used to train rtrees from the two datasets to again see the differences in distributions of D and sd_D:
bind_rows(
hamby_comparisons_raw %>%
filter(flag == FALSE) %>%
select(all_of(rtrees_features)) %>%
pivot_longer(cols = everything()) %>%
mutate(dataset = "hamby_comparisons"),
CCFs_withlands_raw %>% select(all_of(rtrees_features)) %>%
pivot_longer(cols = everything()) %>%
mutate(dataset = "CCFs_withlands")
) %>%
ggplot(aes(x = value, fill = dataset)) +
geom_histogram(position = "dodge") +
facet_wrap(. ~ name, scales = "free") +
theme(legend.position = "bottom")
There are discrepancies in the naming conventions of the two datasets. This section contains the code that identifies the differences and cleans the data.
The following variables are contained in the raw versions of the data. Note that they have different variables and different names. For our analysis, we only need the variables identifying the lands, the features used to train rtrees, the ground truth for whether or not the bullets are a match, and variables flagging bullets with tank rash.
names(hamby_comparisons_raw)
## [1] "land_id1" "land_id2" "ccf" "rough_cor"
## [5] "lag" "abs_lag" "D" "sd_D"
## [9] "matches" "mismatches" "cms" "overlap"
## [13] "non_cms" "sum_peaks" "signature_length" "same_source"
## [17] "flag"
names(CCFs_withlands_raw)
## [1] "compare_id" "profile1_id" "profile2_id" "ccf"
## [5] "rough_cor" "lag" "D" "sd_D"
## [9] "signature_length" "overlap" "matches" "mismatches"
## [13] "cms" "non_cms" "sum_peaks" "land_id.x"
## [17] "land_id.y" "match" "study.x" "barrel.x"
## [21] "bullet.x" "land.x" "study.y" "barrel.y"
## [25] "bullet.y" "land.y"
The code below separates the land id variables in hamby_comparisons_raw into separate columns for study, barrel, bullet, and land and selects only the variables of interest.
hamby_comparisons_select <-
hamby_comparisons_raw %>%
separate(land_id1, c("study1", "barrel1", "bullet1", "land1")) %>%
separate(land_id2, c("study2", "barrel2", "bullet2", "land2")) %>%
select(
study1,
barrel1,
bullet1,
land1,
study2,
barrel2,
bullet2,
land2,
all_of(rtrees_features),
same_source,
flag
)
The code below renames some of the variables in CCFs_withlands_raw, selects only the variables of interest, and converts all land labels to characters.
CCFs_withlands_select <-
CCFs_withlands_raw %>%
rename(
"study1" = "study.x",
"barrel1" = "barrel.x",
"bullet1" = "bullet.x",
"land1" = "land.x",
"study2" = "study.y",
"barrel2" = "barrel.y",
"bullet2" = "bullet.y",
"land2" = "land.y",
"same_source"= "match"
) %>%
select(
study1,
barrel1,
bullet1,
land1,
study2,
barrel2,
bullet2,
land2,
all_of(rtrees_features),
same_source
) %>%
mutate(across(c(barrel1, bullet1, land1, barrel2, bullet2, land2), as.character))
Check to make sure the structures of the two datasets agree:
hamby_comparisons_select %>% str()
## 'data.frame': 84666 obs. of 19 variables:
## $ study1 : chr "Hamby173" "Hamby173" "Hamby173" "Hamby173" ...
## $ barrel1 : chr "Br10" "Br10" "Br10" "Br10" ...
## $ bullet1 : chr "B1" "B1" "B1" "B1" ...
## $ land1 : chr "L1" "L1" "L1" "L1" ...
## $ study2 : chr "Hamby173" "Hamby173" "Hamby173" "Hamby173" ...
## $ barrel2 : chr "Br10" "Br10" "Br10" "Br10" ...
## $ bullet2 : chr "B1" "B1" "B1" "B1" ...
## $ land2 : chr "L2" "L3" "L4" "L5" ...
## $ ccf : num 0.187 0.256 0.233 0.349 0.207 ...
## $ rough_cor : num -0.0326 0.1787 0.0808 0.1682 0.0381 ...
## $ D : num 2.93 1.43 2.49 2.29 2.66 ...
## $ sd_D : num 2.64 1.88 2.57 2.68 2.57 ...
## $ matches : num 3.21 3.37 2.37 2.66 1.13 ...
## $ mismatches : num 7.48 9.54 12.43 9.56 10.69 ...
## $ cms : num 1.6 1.68 2.37 1.59 1.13 ...
## $ non_cms : num 3.74 2.8 9.47 5.31 5.63 ...
## $ sum_peaks : num 4.21 3.63 2.3 3.92 0.45 ...
## $ same_source: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ flag : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
CCFs_withlands_select %>% str()
## 'data.frame': 83028 obs. of 18 variables:
## $ study1 : chr "Hamby252" "Hamby252" "Hamby252" "Hamby252" ...
## $ barrel1 : chr "10" "10" "10" "10" ...
## $ bullet1 : chr "1" "1" "1" "1" ...
## $ land1 : chr "1" "1" "1" "1" ...
## $ study2 : chr "Hamby252" "Hamby252" "Hamby252" "Hamby252" ...
## $ barrel2 : chr "10" "10" "10" "10" ...
## $ bullet2 : chr "1" "1" "1" "1" ...
## $ land2 : chr "2" "3" "4" "5" ...
## $ ccf : num 0.173 0.422 0.175 0.424 0.294 ...
## $ rough_cor : num -0.58411 -0.41022 -0.16741 0.27833 -0.00905 ...
## $ D : num 0.00341 0.00332 0.00208 0.00187 0.00214 ...
## $ sd_D : num 0.00478 0.00479 0.00359 0.00329 0.00348 ...
## $ matches : num 1.09 1.22 2.19 4.09 5.41 ...
## $ mismatches : num 6.62 9.15 5.42 7 6 ...
## $ cms : num 0.547 0.612 1.644 2.338 3.249 ...
## $ non_cms : num 4.07 8.13 4.43 2.8 1.5 ...
## $ sum_peaks : num 0.303 1.864 1.654 4.414 5.142 ...
## $ same_source: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
The function below extracts the labels from the two datasets for a given variable name var and creates a plot to show the discrepancies between hamby_comparison_select and CCFs_withlands_select.
plot_labels <- function(var, ccf_data) {
# Extract the labels
hc_labels <-
unique(c(
hamby_comparisons_select %>% pull(paste0(var, "1")),
hamby_comparisons_select %>% pull(paste0(var, "2"))
))
ccfwl_labels <-
unique(c(
ccf_data %>% pull(paste0(var, "1")),
ccf_data %>% pull(paste0(var, "2"))
))
# Plot the labels
data.frame(data = c(
rep("hamby comparisons", length(hc_labels)),
rep("CCFs with lands", length(ccfwl_labels))
),
labels = c(hc_labels, ccfwl_labels)) %>%
ggplot(aes(x = labels, y = data)) +
geom_tile() +
labs(x = "", y = "", title = paste("Comparing labels in variable:", var))
}
The plots below show that the following changes need to be made to CCFs_withlands_select in order to match with hamby_comparison_select: - study label of Hamby44 needs to be changed to Hamby173 - lettered barrels need to be changed to BrUnk and Br needs to be added to other barrels - barrel letters need to be made the bullet label and B needs to be added to all bullets - L needs to be added to lands
plot_labels(var = "study", ccf_data = CCFs_withlands_select)
plot_labels(var = "barrel", ccf_data = CCFs_withlands_select)
plot_labels(var = "bullet", ccf_data = CCFs_withlands_select)
plot_labels(var = "land", ccf_data = CCFs_withlands_select)
Determine the letters in the CCFs_withlands_select unknown barrels:
all_barrel_labels <-
unique(c(
as.character(CCFs_withlands_select$barrel1),
as.character(CCFs_withlands_select$barrel2)
))
letters <- all_barrel_labels[!(all_barrel_labels %in% 1:10)]
letters
## [1] "B" "C" "D" "E" "F" "H" "J" "L" "M" "Q" "S" "U" "X" "Y" "Z" "A" "G" "I" "N"
## [20] "R" "V" "W"
Clean CCFs_withlands_select so the labels match bullet_train_raw:
CCFs_withlands_labelled <-
CCFs_withlands_select %>%
mutate(
study1 = ifelse(study1 == "Hamby44", "Hamby173", study1),
study2 = ifelse(study2 == "Hamby44", "Hamby173", study2),
bullet1 = ifelse(barrel1 %in% letters, barrel1, bullet1),
barrel1 = ifelse(barrel1 %in% letters, "Unk", barrel1),
bullet2 = ifelse(barrel2 %in% letters, barrel2, bullet2),
barrel2 = ifelse(barrel2 %in% letters, "Unk", barrel2)
) %>%
mutate(
barrel1 = paste0("Br", barrel1),
bullet1 = paste0("B", bullet1),
land1 = paste0("L", land1),
barrel2 = paste0("Br", barrel2),
bullet2 = paste0("B", bullet2),
land2 = paste0("L", land2)
)
The plots below show that the labels are now in agreement:
plot_labels(var = "study", ccf_data = CCFs_withlands_labelled)
plot_labels(var = "barrel", ccf_data = CCFs_withlands_labelled)
plot_labels(var = "bullet", ccf_data = CCFs_withlands_labelled)
plot_labels(var = "land", ccf_data = CCFs_withlands_labelled)
Create the land ids for both datasets:
hamby_comparisons <-
hamby_comparisons_select %>%
mutate(
land_id1 = paste(study1, barrel1, bullet1, land1, sep = "-"),
land_id2 = paste(study2, barrel2, bullet2, land2, sep = "-")
) %>%
select(land_id1, land_id2, all_of(rtrees_features), same_source, flag)
CCFs_withlands <-
CCFs_withlands_labelled %>%
mutate(
land_id1 = paste(study1, barrel1, bullet1, land1, sep = "-"),
land_id2 = paste(study2, barrel2, bullet2, land2, sep = "-")
) %>%
select(land_id1, land_id2, all_of(rtrees_features), same_source)
The structures of the cleaned data:
hamby_comparisons %>% str()
## 'data.frame': 84666 obs. of 13 variables:
## $ land_id1 : chr "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" "Hamby173-Br10-B1-L1" ...
## $ land_id2 : chr "Hamby173-Br10-B1-L2" "Hamby173-Br10-B1-L3" "Hamby173-Br10-B1-L4" "Hamby173-Br10-B1-L5" ...
## $ ccf : num 0.187 0.256 0.233 0.349 0.207 ...
## $ rough_cor : num -0.0326 0.1787 0.0808 0.1682 0.0381 ...
## $ D : num 2.93 1.43 2.49 2.29 2.66 ...
## $ sd_D : num 2.64 1.88 2.57 2.68 2.57 ...
## $ matches : num 3.21 3.37 2.37 2.66 1.13 ...
## $ mismatches : num 7.48 9.54 12.43 9.56 10.69 ...
## $ cms : num 1.6 1.68 2.37 1.59 1.13 ...
## $ non_cms : num 3.74 2.8 9.47 5.31 5.63 ...
## $ sum_peaks : num 4.21 3.63 2.3 3.92 0.45 ...
## $ same_source: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ flag : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
CCFs_withlands %>% str()
## 'data.frame': 83028 obs. of 12 variables:
## $ land_id1 : chr "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" "Hamby252-Br10-B1-L1" ...
## $ land_id2 : chr "Hamby252-Br10-B1-L2" "Hamby252-Br10-B1-L3" "Hamby252-Br10-B1-L4" "Hamby252-Br10-B1-L5" ...
## $ ccf : num 0.173 0.422 0.175 0.424 0.294 ...
## $ rough_cor : num -0.58411 -0.41022 -0.16741 0.27833 -0.00905 ...
## $ D : num 0.00341 0.00332 0.00208 0.00187 0.00214 ...
## $ sd_D : num 0.00478 0.00479 0.00359 0.00329 0.00348 ...
## $ matches : num 1.09 1.22 2.19 4.09 5.41 ...
## $ mismatches : num 6.62 9.15 5.42 7 6 ...
## $ cms : num 0.547 0.612 1.644 2.338 3.249 ...
## $ non_cms : num 4.07 8.13 4.43 2.8 1.5 ...
## $ sum_peaks : num 0.303 1.864 1.654 4.414 5.142 ...
## $ same_source: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Save the cleaned data:
write.csv(x = hamby_comparisons, file = "../../../data/hamby_comparisons_clean.csv", row.names = FALSE)
write.csv(x = CCFs_withlands, file = "../../../data/CCFs_withlands_clean.csv", row.names = FALSE)
This section considers the differences in bullets included in the two datasets. The code below extracts the unique land IDs from the two datasets:
hamby_comparisons_ids = unique(c(hamby_comparisons$land_id1, hamby_comparisons$land_id2))
CCFs_withlands_ids = unique(c(CCFs_withlands$land_id1, CCFs_withlands$land_id2))
Below are the number of bullets contained in each of the datasets. There are less in CCFs_withlands than hamby_comparisons:
length(hamby_comparisons_ids)
## [1] 412
length(CCFs_withlands_ids)
## [1] 408
Identify the land ID in CCFs_withlands but not in hamby-comparisons (Heike identified that Hamby173-Br3-B2-L1 has damage on the bullet, but she is not sure why Hamby173-BrUnk-BM-L4 was left out):
CCFs_withlands_ids[!(CCFs_withlands_ids %in% hamby_comparisons_ids)]
## [1] "Hamby173-Br3-B2-L1" "Hamby173-BrUnk-BM-L4"
Identify the land ID in hamby-comparisons but not in CCFs_withlands (Heike confirmed that bullet E is known to have issues):
hamby_comparisons_ids[!(hamby_comparisons_ids %in% CCFs_withlands_ids)]
## [1] "Hamby173-BrUnk-BE-L1" "Hamby173-BrUnk-BE-L2" "Hamby173-BrUnk-BE-L3"
## [4] "Hamby173-BrUnk-BE-L4" "Hamby173-BrUnk-BE-L5" "Hamby173-BrUnk-BE-L6"
One of the aspects which led to the realization that the hamby-comparisons data is not the training data for rtrees is that the number of observations in the data do not agree with the number of predictions in the rtrees model. The number of observations used to train rtrees is the following:
length(bulletxtrctr::rtrees$predicted)
## [1] 83028
Here are the dimensions of the hamby_comparisons data which does not agree with the number of observations used to train rtrees:
hamby_comparisons %>% dim()
## [1] 84666 13
Even with the observations removed that are known to have tank rash (flag != FALSE), the dimensions of the hamby_comparisons data do not agree with rtrees:
hamby_comparisons %>% filter(flag == FALSE) %>% dim()
## [1] 84255 13
The code below removes observations based on Eric Hare’s code (available here), which he thinks may be the code used to train rtrees:
bullet_train_eric_filter <-
hamby_comparisons %>%
separate(col = "land_id1", into = c("study1", "barrel1", "bullet1", "land1")) %>%
separate(col = "land_id2", into = c("study2", "barrel2", "bullet2", "land2")) %>%
filter(study1 != "Hamby173" | bullet1 != "BE") %>%
filter(study2 != "Hamby173" | bullet2 != "BE")
# Eric's code for reference:
# filter(study.y != "Hamby44" | barrel.y != "E") %>%
# filter(study.x != "Hamby44" | barrel.x != "E") %>%
Again, the dimensions do not agree with the number of observations in rtrees:
dim(bullet_train_eric_filter)
## [1] 82215 19
length(bulletxtrctr::rtrees$predicted) == dim(bullet_train_eric_filter)[1]
## [1] FALSE
In the Hare, Hofmann, and Carriquiry (2017), they describe removing four land impressions that were flagged for quality assessment:
These lands should correspond to Hamby 252. However, it is possible to obtain the same number of observations as used to train rtrees by filtering out observations in hamby_comparisons using these barrel, bullet, and land combinations without specifying the study (that is removing observations from both Hamby 173 and Hamby 252).
hamby_comparisons_filtered <-
hamby_comparisons %>%
mutate(bbl1 = str_remove(land_id1, pattern = "Hamby252-|Hamby173-"),
bbl2 = str_remove(land_id2, pattern = "Hamby252-|Hamby173-")) %>%
filter(!(bbl1 %in% c("Br6-B2-L1", "Br9-B2-L4", "BrUnk-BB-L2", "BrUnk-BQ-L4") |
bbl2 %in% c("Br6-B2-L1", "Br9-B2-L4", "BrUnk-BB-L2", "BrUnk-BQ-L4"))) %>%
select(-bbl1, -bbl2)
hamby_comparisons_filtered %>% dim()
## [1] 83028 13
Save the cleaned and filtered data:
write.csv(x = hamby_comparisons_filtered , file = "../../../data/hamby_comparisons_filtered.csv", row.names = FALSE)
While it is possible to use hamby_comparisons to get to the same number of observations as rtrees, it is only a guess as to how the bullets were removed. On the other hand, the number of rows in CCFs_withlands already has the same number of rows as rtrees:
CCFs_withlands %>% dim()
## [1] 83028 12
Note that if the code used in Eric Hare’s R script, which he thinks may be the code used to train rtrees), is used to filter CCFs_withlands, the number of rows does not change!
CCFs_withlands %>%
separate(col = "land_id1", into = c("study1", "barrel1", "bullet1", "land1")) %>%
separate(col = "land_id2", into = c("study2", "barrel2", "bullet2", "land2")) %>%
filter(study1 != "Hamby173" | bullet1 != "BE") %>%
filter(study2 != "Hamby173" | bullet2 != "BE") %>%
dim()
## [1] 83028 18
It is not possible to directly compare the internal predictions from rtrees (predict(bulletxtrctr::rtrees, type = "prob")[,2])) to those obtained by using rtrees to make predictions on the hamby_comparisons and CCFs_withlands data, because those obtained using the predict function from randomForest are the out-of-bag (OOB) predictions. The predictions obtained by using the predict function from randomForest to obtain predictions on a new dataset are created using all the training data. From the documentation for randomForest:::predict.randomForest:
newdata: a data frame or matrix containing new data. (Note: If not given, the out-of-bag prediction in object is returned.
This was determined by Heike based on the following code. She pointed out, “Have a look at the code below - we would expect a line of identity in the last plot. Sorting values also does not help (that would help in case the observations got re-ordered internally in some way).”
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
z <- rbinom(1000,1, 0.4)
dframe <- data.frame(x1,x2,x3,z)
rf1 <- randomForest(factor(z)~x1+x2+x3, data = dframe)
dframe$predict <- predict(rf1, newdata=dframe, type="prob")[,2]
dframe$rfscore <- predict(rf1, type="prob")[,2]
dframe %>%
ggplot(aes(x = predict, y = rfscore)) + geom_point()
However, we may still gain some insights by comparing the distributions of the predictions. The code below extracts the predictions on the training data from rtrees and uses rtrees to make predictions on hamby_comparisons and CCFs_withlands:
rtrees_pred = predict(bulletxtrctr::rtrees, type = "prob")[,2]
hc_pred = predict(bulletxtrctr::rtrees, hamby_comparisons %>% select(all_of(rtrees_features)), type = "prob")[,2]
ccfwl_pred = predict(bulletxtrctr::rtrees, CCFs_withlands %>% select(all_of(rtrees_features)), type = "prob")[,2]
all_pred <-
data.frame(
dataset = c(
rep("Internal (OOB)", length(rtrees_pred)),
rep("Hamby comparisons", length(hc_pred)),
rep("CCFs with lands", length(ccfwl_pred))
),
pred = c(rtrees_pred, hc_pred, ccfwl_pred)
)
Plot showing distributions of three set of predictions from rtrees: (light blue) internal predictions from rtrees, (blue) predictions obtained by applying rtrees to the observations in the CCFs_withlands data, and (purple) predictions obtained by applying rtrees to the observations in the hamby_comparisons data. Note that even though the internal predictions from rtrees are OOB and those computed using CCFs_withlands are not OOB, the distributions are very similar. On the other hand, the predictions from hamby_comparions are very different. This provides evidence that rtrees was trained on CCFs_withlands, and the change in units of D and sd_D in hamby_comparisons is apparent by the much different distribution of predictions from rtrees:
all_pred %>%
ggplot(aes(x = pred, fill = dataset, color = dataset)) +
geom_density(alpha = 0.5, size = 1) +
scale_fill_manual(values = c("blue", "purple", "lightblue")) +
scale_color_manual(values = c("blue", "purple", "lightblue")) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = "rtrees predictions",
y = "Density",
fill = "Data predictions computed on",
color = "Data predictions computed on")
The four plots below show direct comparisons of predictions between the internal
rtrees predictions and those computed on hamby_comparisons and CCFs_withlands using rtrees. There are two versions of the plots for each of the possible training datasets. One plots the predictions without adjusting the order of the predictions in any way. The other plot sorts the predictions from lowest to highest (for a dataset) in case it helps to see a direct relationship. None of the figures show a 1-to-1 relationship, but notice that the plot with the title “Comparing internal to CCFs with lands (not sorted)” looks very similar to the one created by Heike with the example data and random forest. This is additional evidence that CCFs_withlands was used to train rtrees since even without sorting the predictions, the relationship between the OOB and new predictions is very similar to an example where we know the same training data was used.
plot_grid(
all_pred %>%
filter(dataset != "CCFs with lands") %>%
group_by(dataset) %>%
mutate(case = 1:n()) %>%
pivot_wider(names_from = "dataset", values_from = "pred") %>%
ggplot(aes(x = `Hamby comparisons`, y = `Internal (OOB)`)) +
geom_point() +
theme(aspect.ratio = 1) +
labs(title = "Comparing internal to Hamby Comparisons (not sorted)"),
all_pred %>%
filter(dataset != "CCFs with lands") %>%
group_by(dataset) %>%
arrange(pred) %>%
mutate(case = 1:n()) %>%
pivot_wider(names_from = "dataset", values_from = "pred") %>%
ggplot(aes(x = `Hamby comparisons`, y = `Internal (OOB)`)) +
geom_point() +
theme(aspect.ratio = 1) +
labs(title = "Comparing internal to Hamby Comparisons (sorted)"),
all_pred %>%
filter(dataset != "Hamby comparisons") %>%
group_by(dataset) %>%
mutate(case = 1:n()) %>%
pivot_wider(names_from = "dataset", values_from = "pred") %>%
ggplot(aes(x = `CCFs with lands`, y = `Internal (OOB)`)) +
geom_point() +
theme(aspect.ratio = 1) +
labs(title = "Comparing internal to CCFs with lands (not sorted)"),
all_pred %>%
filter(dataset != "Hamby comparisons") %>%
group_by(dataset) %>%
arrange(pred) %>%
mutate(case = 1:n()) %>%
pivot_wider(names_from = "dataset", values_from = "pred") %>%
ggplot(aes(x = `CCFs with lands`, y = `Internal (OOB)`)) +
geom_point() +
theme(aspect.ratio = 1) +
labs(title = "Comparing internal to CCFs with lands (sorted)")
)
In this section, we retrain various random forest models using both the Hamby comparison and CCFs with lands data. Then we compare the results from the retrained models to rtrees to check for similarity. The results show that the retrained models on the CCFs with lands data are very similar to rtrees while the model trained on the Hamby comparison data are very different. It should be noted that all models trained in this section on the Hamby comparison data used the observations in the filtered version of the dataset (hamby_comparisons_filtered) so that the number of observations match rtrees and since this was the data we were previously using in the LIME diagnostics paper.
Unfortunately, the seed that was used to train rtrees was lost or the changes in the way the random number generation is done in R has changed has made it not possible to reproduce rtrees. However, Heike has two seeds that were used around the time of training rtrees The seeds used were selected by Heike based on ones she believes were used at the time of training rtrees: 20140501 and 20170222. Note that 20170222 is also the seed used in Eric’s code.
Below, four random forest models are trained using the seeds suggested by Heike. The code used to fit the models is based on Eric’s code, which is believed to be the code used to train the original rtrees. The first two random forest are fit using CCFs_withlands, and the third and fourth models are fit using hamby_comparisons_filtered:
rtrees2_file = "../../../data/rfs_rtrees2.rds"
if (!file.exists(rtrees2_file)) {
set.seed(20140501)
rtrees2 <-
randomForest(
factor(same_source) ~ .,
data = CCFs_withlands %>% select(rtrees_features, same_source),
ntree = 300
)
saveRDS(object = rtrees2, file = rtrees2_file)
}
rtrees3_file = "../../../data/rfs_rtrees3.rds"
if (!file.exists(rtrees3_file)) {
set.seed(20170222)
rtrees3 <-
randomForest(
factor(same_source) ~ .,
data = CCFs_withlands %>% select(rtrees_features, same_source),
ntree = 300
)
saveRDS(object = rtrees3, file = rtrees3_file)
}
rtrees4_file = "../../../data/rfs_rtrees4.rds"
if (!file.exists(rtrees4_file)) {
set.seed(20140501)
rtrees4 <-
randomForest(
factor(same_source) ~ .,
data = hamby_comparisons_filtered %>% select(rtrees_features, same_source),
ntree = 300
)
saveRDS(object = rtrees4, file = rtrees4_file)
}
rtrees5_file = "../../../data/rfs_rtrees5.rds"
if (!file.exists(rtrees5_file)) {
set.seed(20170222)
rtrees5 <-
randomForest(
factor(same_source) ~ .,
data = hamby_comparisons_filtered %>% select(rtrees_features, same_source),
ntree = 300
)
saveRDS(object = rtrees5, file = rtrees5_file)
}
Load the saved versions of the models:
rtrees2 <- readRDS(rtrees2_file)
rtrees3 <- readRDS(rtrees3_file)
rtrees4 <- readRDS(rtrees4_file)
rtrees5 <- readRDS(rtrees5_file)
Extract the votes from each of the newly trained random forest models and rtrees and prepare the dataframe for plotting:
rtrees_votes <-
data.frame(
rtrees = bulletxtrctr::rtrees$votes,
rtrees2 = rtrees2$votes,
rtrees3 = rtrees3$votes,
rtrees4 = rtrees4$votes,
rtrees5 = rtrees5$votes
) %>%
mutate(obs = 1:n()) %>%
pivot_longer(cols = -obs) %>%
separate(name, c("model", "label")) %>%
pivot_wider(names_from = model, values_from = value)
Plots comparing the votes from rtrees to those from the newly trained models. The top row contains the comparisons with the random forests trained on CCFs_withlands. Note that the first two plots comparing the CCFs_withlands newly trained random forest vote to rtrees looks very similar to the third plots which compares the votes from the two newly trained random forests on CCFs_withlands. This suggests that even though all three models are trained using different seeds, the training data is the same. On the other hand, the bottom row of plots are created using the votes from the random forests trained on hamby_comparisons_filtered, and the two plots comparing the votes to rtrees look very different from the third plot comparing the votes from the two newly trained models. This suggests that the hamby_comparisons_filtered was not the training data for rtrees, or at least, the observations are not ordered the same in the dataset.
plot_grid(
rtrees_votes %>%
ggplot(aes(
x = rtrees, y = rtrees2, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees2 votes",
title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20140501"),
rtrees_votes %>%
ggplot(aes(
x = rtrees, y = rtrees3, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees3 votes",
title = "Retrained RF on Hamby comparison data \nversus rtrees; Seed 20170222"),
rtrees_votes %>%
ggplot(aes(
x = rtrees2, y = rtrees3, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees2 votes",
title = "Both retrained RFs on Hamby \ncomparison data"),
rtrees_votes %>%
ggplot(aes(
x = rtrees, y = rtrees4, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees4 votes",
title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20140501"),
rtrees_votes %>%
ggplot(aes(
x = rtrees, y = rtrees5, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees5 votes",
title = "Retrained RF on CCFs with lands data \nversus rtrees; Seed 20170222"),
rtrees_votes %>%
ggplot(aes(
x = rtrees4, y = rtrees5, color = label
)) +
geom_point(shape = 1) +
labs(x = "rtrees votes",
y = "rtrees5 votes",
title = "Both retrained RFs on CCFs with \nlands data"),
nrow = 2
)