class: center, middle, inverse, title-slide .title[ # Tracing Trees ] .subtitle[ ## .light_green[Visualizing Variability in the Architecture of Random Forest Trees Using Extensions of Trace Plots] ] .author[ ###
] .author[ ### Katherine Goode ] .author[ ### Presented at ISU Graphics Group ] .date[ ### April 1, 2021
.small[Code used to create slides available
here
] ] --- <style type="text/css"> .tiny{font-size: 30%} .small{font-size: 50%} .medium{font-size: 75%} .left-code { color: #777; width: 39%; height: 92%; float: left; } .right-plot { width: 59%; float: right; padding-left: 2%; } .scroll-output { height: 90%; overflow-y: scroll; } .content-box { box-sizing: border-box; border-radius: 15px; margin: 0 0 15px; overflow: hidden; padding: 0px 20px 0px 20px; width: 100%; background-color: #c7cfb7; } </style> # Overview - Background - Random forests - Trace plots <br> - TreeTracer: Trace Plots in R <br> - Extending Trace Plots <br> - Patterns in the Forest <br> - Limitations and Ideas for Future Work --- class: inverse, center, middle # Background --- ## Random Forests <img src="figures/rf-diagram.png" width="2580" style="display: block; margin: auto;" /> --- ## Common Tree Visualization From [Urbanek (2008)](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11): <img src="figures/tree.png" width="100%" style="display: block; margin: auto;" /> --- ## Trace Plots (one tree) From [Urbanek (2008)](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11): <img src="figures/trace.png" width="100%" style="display: block; margin: auto;" /> --- ## Trace Plots (forest of trees) From [Urbanek (2008)](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11): <img src="figures/trace-plot.png" width="85%" style="display: block; margin: auto;" /> --- ## Example: Predicting Penguin Species ```r # Load the Palmer penguins data and extract features penguins <- na.omit(palmerpenguins::penguins) penguins_feat <- penguins %>% select(bill_depth_mm, bill_length_mm, flipper_length_mm, body_mass_g) ``` <img src="slides_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- <img src="slides_files/figure-html/unnamed-chunk-8-1.png" width="90%" style="display: block; margin: auto;" /> --- ```r # Fit a random forest set.seed(71) penguins_rf <- randomForest::randomForest( species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g, data = penguins, ntree = 50 ) ``` ```r # Print the confusion matrix penguins_rf$confusion ``` ``` ## Adelie Chinstrap Gentoo class.error ## Adelie 142 3 1 0.027397260 ## Chinstrap 4 64 0 0.058823529 ## Gentoo 0 1 118 0.008403361 ``` --- <br> <img src="slides_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- <img src="slides_files/figure-html/unnamed-chunk-12-1.png" width="65%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # TreeTracer: Trace Plots in R --- ## TreeTracer R Package .content-box[ Functions for: - Creating trace plots from random forests (with some extensions) - Extract tree data in a data frame for trace plots - Compute distances between trees ] <br> GitHub repo: [https://github.com/goodekat/TreeTracer](https://github.com/goodekat/TreeTracer) <br> ```r # To install the package from GitHub # Use with caution -- very much still in development remotes::install_github("goodekat/TreeTracer") ``` --- ## Individual Tree Data Frame ### randomForest ```r rf_tree1 <- randomForest::getTree(rfobj = penguins_rf, k = 1) ```
--- ## Individual Tree Data Frame ### TreeTracer ```r tt_tree1 = TreeTracer::get_tree_data(rf = penguins_rf, k = 1) ```
--- ## Trace Plot Data Frame ```r tree1_trace <- * TreeTracer::get_trace_data( tree_data = tt_tree1, rf = penguins_rf, train = penguins_feat ) ```
--- ## Trace Plot Function (one tree) .left-code[ ```r penguin_trace_tree1 <- * TreeTracer::trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1, alpha = 1 ) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-21-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Trace Plot Function (multiple trees) .left-code[ ```r ntrees = penguins_rf$ntree penguin_trace <- * TreeTracer::trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:ntrees, alpha = 0.4 ) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-23-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Extending Trace Plots --- ## Coloring Trees .left-code[ ```r penguin_trace_col <- trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:ntrees, alpha = 0.4, * tree_color = "#9dad7f" ) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-25-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Display a Representative Tree .left-code[ ```r penguin_trace_rep <- trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:ntrees, alpha = 0.4, tree_color = "#9dad7f", * rep_tree = * get_tree_data( * rf = penguins_rf, * k = 12 * ), * rep_tree_size = 1.5, * rep_tree_alpha = 0.9, * rep_tree_color = "#557174" ) + labs( title = "Highlighting Tree 12" ) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-27-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Color by ID .left-code[ ```r penguin_trace_by_id <- trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:6, alpha = 0.9, * color_by_id = TRUE ) + scale_color_manual( values = c( "#c7cfb7", "#9dad7f", "#557174", "#D67236", "#F1BB7B", "#916a89" )) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-29-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Facet by ID .left-code[ ```r penguin_trace_facet <- trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:6, alpha = 0.9, color_by_id = TRUE, * facet_by_id = TRUE ) + scale_color_manual( values = c( "#c7cfb7", "#9dad7f", "#557174", "#D67236", "#F1BB7B", "#916a89" )) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Maximum Depth .left-code[ ```r penguin_trace_max <- trace_plot( rf = penguins_rf, train = penguins_feat, tree_ids = 1:ntrees, alpha = 0.4, * max_depth = 3 ) ``` ] .right-plot[ <img src="slides_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> ] --- <img src="figures/bullet-rf.png" width="23%" style="display: block; margin: auto;" /> --- <img src="figures/bullet-rf-small.png" width="960" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Patterns in the Forest --- ## Two Approaches to Finding Patterns ### (1) Clusters of Trees - Are there clusters of trees within a forest? - Would tell us if similar or different decision paths are used by the forest ### (2) Representative Tree - Can we extract a tree that represents the forest? - Can we extract a tree representative of each cluster? .pull-left[ ### Examples of Previous Work .medium[ - [Sies and Van Mechelen (2020)](https://link.springer.com/article/10.1007/s00357-019-09350-4) - [Weinberg and Last (2019)](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0186-3) - [Weinberg and Last (2017)](https://sciendo.com/article/10.1515/amcs-2017-0051) - [Banerjee, Ding, and Noone (2011)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4492) - [Miglio and Soffritti (2004)](https://www.sciencedirect.com/science/article/abs/pii/S016794730300063X?via%3Dihub) - [Shannon and Banks (1999)](https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291097-0258%2819990330%2918%3A6%3C727%3A%3AAID-SIM61%3E3.0.CO%3B2-2) - [Chipman, George, and McCulloch (1998)](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598) ]] .pull-right[<br>.content-box[.center[ **Trace plots would be a great visualization tool for both approaches!** ]]] --- ## Visualizing Clusters .pull-left[ Example of representative trees from clusters within a random forest from [Chipman, George, and McCulloch (1998)](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598): <img src="figures/chipman.png" width="120%" style="display: block; margin: auto;" /> ] .pull-right[ Same trees in a trace plot: <img src="slides_files/figure-html/unnamed-chunk-37-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Visualizing Representative Trees with Variability Two scenarios: .pull-left[ <img src="slides_files/figure-html/unnamed-chunk-39-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides_files/figure-html/unnamed-chunk-40-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Process for Clustering Trees 1. Start with a distance metric to compare similarities between trees - Several implemented in TreeTracer 2. Obtain a distance matrix 3. Apply a clustering method: - Hierarchical clustering - K-nearest neighbors - Multi-dimensional scaling 4. Visualize clusters using trace plots --- ## Step 1: Compute Distances ### Strategies to compare trees Figures from [Sies and Van Mechelen (2020)](https://link.springer.com/article/10.1007/s00357-019-09350-4) .left-code[ .center[**Comparing Predictions**] <img src="figures/predictions.png" width="100%" style="display: block; margin: auto;" /> ] .right-plot[ .center[**Comparing Topology**] <img src="figures/topology.png" width="120%" style="display: block; margin: auto;" /> ] --- ### Current metrics implemented in TreeTracer .pull-left[ **[Chipman, George, and McCulloch (1998)](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598)** ```r # Fit metric: Compare # predictions from two trees fit_metric <- compute_fit_metric( rf = penguins_rf, data = penguins_feat ) ``` ```r # Partition metric: Determine # whether two predictions fall # in same leaf in two trees tree_preds <- get_tree_preds( data = penguins_feat, rf = penguins_rf ) partition_metric <- compute_partition_metric( rf = penguins_rf, tree_preds = tree_preds ) ``` ] .pull-right[ **[Banerjee, Ding, and Noone (2011)](https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4492)** ```r # Covariate metric: Determine # the agreement in predictors # used by the trees cov_metric <- compute_covariate_metric( rf = penguins_rf ) cov_metric3 <- compute_covariate_metric( rf = penguins_rf, max_depth = 3 ) ``` ] --- ### Metric Details <img src="figures/metrics.png" width="120%" style="display: block; margin: auto;" /> <br> <br> <br> <br> <!-- Let: --> <!-- - `\(T_1\)` and `\(T_2\)` two trees trained using `\((y_i,\textbf{x}_i)\)` for `\(i=1,...,n\)` --> <!-- - `\(\textbf{x}_i=(x_{i1},...,x_{ik})\)` a vector of `\(k\)` covariates for observation `\(i\)` --> <!-- <br> --> <!-- **Fit metric:** --> <!-- `$$d\left(T_1,T_2\right)=\frac{1}{n}\sum_{i=1}^n m\left(\hat{y}_{i1},\hat{y}_{i2}\right)$$` --> <!-- where: --> <!-- - `\(\hat{y}_{ij}\)` is a fitted value for tree `\(j\)` --> <!-- - `\(m\)` is a metric such as --> <!-- - for a regression tree --> <!-- `$$m\left(\hat{y}_1,\hat{y}_2\right)=\left(\hat{y}_1-\hat{y}_2\right)^2$$` --> <!-- - for a classification tree --> <!-- `$$m\left(y_1,y_2\right)=\begin{cases} 1 & \mbox{if} \ \ y_1=y_2 \\ 0 & \mbox{o.w.} \end{cases}$$` --> <!-- **Partition metric:** --> <!-- \begin{equation} --> <!-- d\left(T_1, T_2\right)=\frac{\sum_{i>k}\left|I_1(i,k)-I_2(i,k)\right|}{n\choose2} --> <!-- \end{equation} --> <!-- where: --> <!-- .medium[ --> <!-- `$$I_1(i,k) =\begin{cases} 1 & \mbox{if } T_1 \mbox{ places observations } i \mbox{ and } k \mbox{ in the same terminal node} \\ 0 & \mbox{o.w.} \end{cases}$$` --> <!-- ] --> <!-- Note: The metric is scaled to the range of (0,1). --> <!-- <br> --> <!-- **Covariate metric** --> <!-- `$$d(T_1, T_2)=\frac{\mbox{# of covariate mismatches between } T_1 \mbox{ and } T_2}{k}$$` --> --- ## Step 2: Obtain Distance Matrix <img src="slides_files/figure-html/unnamed-chunk-50-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Step 3: Apply Clustering Method <img src="slides_files/figure-html/unnamed-chunk-51-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Step 4: Visualize clusters ### Covariate Metric - Coordinate 1 explains variability between trees that use 3 or 4 variables for splits <br> <img src="slides_files/figure-html/unnamed-chunk-54-1.png" width="100%" style="display: block; margin: auto;" /> --- ### Covariate Metric (max depth 3) - Coordinate 1 explains variability in trees that use body mass for splitting in the first three levels or not - Coordinate 2 explains variability in trees that use flipper length for splitting in the first three levels or not <img src="slides_files/figure-html/unnamed-chunk-56-1.png" width="100%" style="display: block; margin: auto;" /> --- ### Fit Metric Not clear why these trees are outliers based on the fit metric using a trace plot - Perhaps a different visualization would be more helpful in this situation or - Could focus on trying to understand the variability within the large cluster <img src="slides_files/figure-html/unnamed-chunk-58-1.png" width="100%" style="display: block; margin: auto;" /> --- ### Partition Metric <img src="slides_files/figure-html/unnamed-chunk-60-1.png" width="45%" style="display: block; margin: auto;" /> <img src="slides_files/figure-html/unnamed-chunk-61-1.png" style="display: block; margin: auto;" /> --- <br> <img src="slides_files/figure-html/unnamed-chunk-62-1.png" width="45%" style="display: block; margin: auto;" /> <img src="slides_files/figure-html/unnamed-chunk-63-1.png" style="display: block; margin: auto;" /> --- <br> <br> <img src="slides_files/figure-html/unnamed-chunk-64-1.png" style="display: block; margin: auto;" /> <br> <img src="slides_files/figure-html/unnamed-chunk-65-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Limitations and Ideas for Future Work --- ## Limitations ### Cognitive load - Too much information to extract understanding? <br> ### Overplotting issues - Too many trees hide the trends <br> ### True trends? - Are we actually able to identify realistic similarities and differences between trees? --- ## Ideas for Future Work .pull-left[ ### New Metrics - Metric that compares two traces for similarities - Compare the regions that are used to make a prediction <br> ### Linking to other plots - Sectioned scatterplots - Visualizations of interactions created by splits - Parallel coordinate plots with split points overlaid ] .pull-right[ ### Computing Representative Trees - Implement developed methods - Consider new methods - Visualize rep tree in context of variability <br> ### Other - How to choose a maximum depth? (perhaps based on predictive accuracy) - How to account for categorical variables? ] --- class: inverse, center, middle # Thank you! <img src="figures/penguin.png" width="178" style="display: block; margin: auto;" />