class: center, middle, inverse, title-slide .title[ #
.light_blue[Tracing Trees] ] .subtitle[ ## .ice_blue[Visualizing Random Forest Tree Variability with Trace Plots] ] .author[ ###
] .author[ ### Katherine Goode (5573) ] .author[ ###
kjgoode@sandia.gov
] .date[ ### July 11, 2022
.tiny[Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND 2022-12697 O.] ] --- <script> var slideshow = remark.create({slideNumberFormat : function (current, total) { return current + '/' + (this.getSlideByName("mylastslide").getSlideIndex() + 1); }, highlightStyle: "github", highlightLines: true, countIncrementalSlides: false}); </script> <style type="text/css"> .tiny{font-size: 30%} .small{font-size: 50%} .smallmedium{font-size: 65%} .medium{font-size: 75%} .left-code { color: #777; width: 39%; height: 92%; float: left; } .right-plot { width: 59%; float: right; padding-left: 2%; } .scroll-output { height: 90%; overflow-y: scroll; } .content-box { box-sizing: border-box; border-radius: 15px; margin: 0 0 15px; overflow: hidden; padding: 0px 20px 0px 20px; width: 100%; background-color: #c7cfb7; } .pull-left-v2 { width: 60%; height: 92%; float: left; } .pull-right-v2 { width: 35%; float: right; padding-left: 1%; } .pull-left-v3 { width: 75%; height: 92%; float: left; } .pull-right-v3 { width: 24%; float: right; padding-left: 1%; } .pull-left-v4 { width: 39%; height: 92%; float: left; } .pull-right-v4 { width: 60%; float: right; padding-left: 1%; } </style> ## Introduction to Katherine .pull-left-v3[ Research and development statistician in 5573 **Education** - BA in mathematics from [Lawrence University](https://www.lawrence.edu/academics/college/mathematics) - MS in statistics from [University of Wisconsin - Madison](https://stat.wisc.edu/) - PhD in statistics from [Iowa State University](https://www.stat.iastate.edu/) **Sandia Journey** - Dec 2019: Intern (mentored by Daniel Ries, 5574) - Sep 2021: Post-doc (mentored by J. Derek Tucker, 5573) - Dec 2021: FTE (mentored by J. Derek Tucker, 5573) **Research Interests** - Explainable machine learning - Data visualization - Model assessment Personal website: [goodekat.github.io](https://goodekat.github.io) Personal github: [github.com/goodekat](https://github.com/goodekat) ] .pull-right-v3[ <img src="fig-static/kat-mina.JPEG" width="90%" style="display: block; margin: auto;" /> <br> <img src="fig-static/kat-kayak.JPG" width="90%" style="display: block; margin: auto;" /> ] --- ## Overview <br> - **Background**: Trace Plots <br> - **Methods**: Extending Trace Plots - .medium_grey[TreeTracer]: Implementation and Structural Augmentations in R - .medium_grey[Tree Summaries]: Identifying Representative Trees <br> - **Music Example**: Application with "larger" random forest <br> - **Conclusions**: Pros, Cons, and Possible Research Directions <br> .center[*Credits: Joint work with Heike Hofmann (Professor at Iowa State University)*] --- class: inverse, center, middle # .dark_grey[Background:] Trace Plots --- ## Common Tree Visualization <br> <img src="fig-static/demo-tree.png" width="100%" style="display: block; margin: auto;" /> .right[.small[Image source: Urbanek (2008)]] --- ## Visual Comparisons of Multiple Trees .pull-left[ **Issues with "traditional" visuals**: - Difficult direct visual comparison - Non-efficient use of space - Identifying patterns is cognitively difficult (figure classification .small[French, Ekstrom, and Price (1963)]) <br> <img src="fig-static/demo-traditional-trees.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/demo-icicles.png" width="90%" style="display: block; margin: auto;" /> .right[.small[Image source: Kuznetsova (2014)]] ] --- ## Trace Plots (one tree) [.small[Urbanek (2008)]](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11) <img src="fig-static/demo-one-trace.png" width="100%" style="display: block; margin: auto;" /> .right[.small[Image source: Urbanek (2008)]] --- ## Trace Plots (esemble of trees) [.small[Urbanek (2008)]](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11) Designed to compare (1) variables used for splitting, (2) location of split points, and (3) hierarchical structure <img src="fig-static/demo-trace-plot.png" width="75%" style="display: block; margin: auto;" /> .right[.small[Image source: Urbanek (2008)]] --- ## Limitations of Trace Plots Example: - **Objective**: Are two bullets fired from same gun? - **Model**: Random forest (300 trees) .small[Hare, Hofmann, and Carriquiry (2017)] - **Response variable**: Same gun? - **Predictor variables**: 9 characteristics comparing two signatures such as cross correlation function (CCF) .pull-left[ <img src="fig-static/bullet-bullets.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/bullet-signatures.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Limitations of Trace Plots .pull-left[ <img src="fig-static/bullet-trace-plot-full.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ **Info gained** - Deep trees (max node depth of 39) - Certain variables more commonly used for first split - All variables commonly used between node depths of 3 and 30 <br> **Difficult to extract patterns when...** - Many trees in a forest - Deep trees - Large number of predictors ] --- class: inverse, center, middle # .dark_grey[Methods:] Extending Trace Plots --- ## Overview **Objective**: Extend trace plots to improve ability to find patterns in random forest architecture | | Intentions | | --- | :------ | | **Who** <br> <br> | Data analysts <br> <br> | | **What** <br> <br> | - Visualization of random forest architecture <br> - .red[One tool in toolbox for explaining random forests] <br> <br> | | **When/Where** <br> <br> | - After model training <br> - Model assessment <br> - Model "explanation" <br> <br> | | **Why** <br> <br> | - Help understand how variables are used <br> - Compare variability in split locations at different node depths <br> - Identify patterns to explore further <br> <br> | | **How** <br> <br> | Using `TreeTracer` R package <br> <br> | --- ## Approaches .pull-left[ .red[Structural Augmentations] - Highlight patterns - Lessen cognitive load <img src="fig-static/demo-structure.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ .medium_grey[Tree Summaries] - Identify summary trees - Re-purpose trace plots for highlighting summary trees <img src="fig-static/demo-summary.png" width="55%" style="display: block; margin: auto;" /> ] --- ## Example: Palmer Penguins - **Data**: 342 penguins from Palmer Archipelago in Antarctica - **Three species**: Adelie, Chinstrap, and Gentoo - **Four body measurements**: Bill length, bill depth, flipper length, body mass - **Random Forest**: Predict species using 50 trees .pull-left-v2[ <img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="80%" style="display: block; margin: auto;" /> .small[ | | Adelie| Chinstrap| Gentoo| Class Error| |:---------|------:|---------:|------:|-----------:| |Adelie | 146| 4| 1| 0.03| |Chinstrap | 4| 64| 0| 0.06| |Gentoo | 0| 1| 122| 0.01| ] ] .pull-right-v2[ <br> <img src="slides_files/figure-html/unnamed-chunk-16-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Implementation of trace plots (and extensions) .pull-left[ **Overview** - R package `TreeTracer` - First readily available implementation in R - GitHub repo: [https://github.com/goodekat/TreeTracer](https://github.com/goodekat/TreeTracer) <br> **Functions** - Create trace plots from `randomForest` R package - Structural augmentations - Compute distances between trees ] .pull-right[ <img src="fig-static/penguin-trace.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Extensions: .red[Structural Augmentations] **Ordering of split variables**: Provides different perspectives <img src="slides_files/figure-html/unnamed-chunk-18-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Extensions: .red[Structural Augmentations] **Subsets of trees**: Lessen cognitive load <img src="fig-static/penguin-trace-subsets.png" width="100%" style="display: block; margin: auto;" /> --- ## Extensions: .red[Structural Augmentations] **Facets**: Separate trees using facets **Use of color and line size**: Highlight individual or groups of trees <img src="slides_files/figure-html/unnamed-chunk-20-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Extensions: .red[Structural Augmentations] **Maximum node depth**: Focus on upper node depths where global structures may exist (e.g., considering the "canopy") <img src="fig-static/bullet-trace-plot-canopy.png" width="80%" style="display: block; margin: auto;" /> --- ## Extensions: .medium_grey[Tree Summaries] **Background (summarizing tree ensembles)** .pull-left[ .red[Representative tree] .small[(Shannon and Banks, 1999; Banerjee, Ding, and Noone, 2012; Weinberg and Last, 2019)] - Identify a tree that is representative of the forest - One approach: Find tree that has smallest average distance to all other trees <img src="fig-static/penguin-ave-dists.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ .red[Clusters of trees] .small[(Chipman, George, and McCulloch, 1998; Sies and Mechelen, 2020)] - Compute distances between trees - Identify clusters via MDS, K-means, etc. <br> <img src="fig-static/penguin-rep-tree.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extensions: .medium_grey[Tree Summaries] **Benefits of trace plots:** Example of representative trees from clusters within a tree ensemble .small[(Chipman, George, and McCulloch, 1998; Sies and Mechelen, 2020)] .pull-left[ <img src="fig-static/demo-chipman.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/demo-rep-trees.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extensions: .medium_grey[Tree Summaries] **Benefits of trace plots:** Two scenarios of visualizing representative trees with variability .pull-left[ <img src="fig-static/demo-var-small.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/demo-var-large.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extensions: .medium_grey[Tree Summaries] **Background (distances between trees)**: Various metrics proposed .small[(Chipman, George, and McCulloch, 1998; Shannon and Banks, 1999; Miglio and Soffritti, 2004; Banerjee, Ding, and Noone, 2012; Sies and Mechelen, 2020)] .pull-left-v4[ .center[.red[Comparing Predictions]] <img src="fig-static/demo-predictions.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-v4[ .center[.red[Comparing Topology]] <img src="fig-static/demo-topology.png" width="100%" style="display: block; margin: auto;" /> ] .right[.small[Image source: Sies and Mechelen (2020)]] --- ## Extensions: .medium_grey[Tree Summaries] **Example Distance Metrics** .medium[ .red[Covariate metric:] Compares split variables from two trees .small[(Banerjee, Ding, and Noone, 2012)] .medium[$$d_{CM}(T_1, T_2)=\frac{\mbox{Number of covariate mismatches between } T_1 \mbox{ and } T_2}{k}.$$] <br> .red[Fit metric:] Compares predictions from two trees .small[(Chipman, George, and McCulloch, 1998)] .medium[ `$$d_{FM}\left(T_1,T_2\right)=\frac{1}{n}\sum_{i=1}^n m\left(\hat{y}_{i1},\hat{y}_{i2}\right)$$` ] <br> .red[Partition metric:] Compares how observations are divided between leaves .small[(Chipman, George, and McCulloch, 1998)] .medium[ `$$d_{PM}\left(T_1, T_2\right)=\frac{\sum_{i>j}\left|I_1(i,j)-I_2(i,j)\right|}{{n\choose2}}$$` ] .medium[ `$$I_t(i,j) =\begin{cases} 1 & \mbox{if } T_t \mbox{ places observations } i \mbox{ and } j \mbox{ in the same terminal node} \\ 0 & \mbox{o.w.} \end{cases}$$` ] ] .medium[ .grey[ Details: .medium[ .pull-left[ - Observation: `\(i\)` with `\(i\in\{1,...,n\}\)` or `\(j\)` with `\(j\in\{1,...,n\}\)` - Response: `\(y_i\)` - Predictor variables: `\(\textbf{x}_{i}=(x_{i1},...,x_{ik})\)` - Fitted value: `\(\hat{y}_{it}\)` - Trees: `\(T_t\)` with `\(t\in\{1,2\}\)` ] .pull-right[ - Metric: `\(m\)` - Regression: `\(m\left(\hat{y}_{i1},\hat{y}_{i2}\right)=\left(\hat{y}_{i1}-\hat{y}_{i2}\right)^2\)` - Classification: `\(m\left(\hat{y}_{i1},\hat{y}_{i2}\right)=\begin{cases} 1 & \mbox{if} \ \ \hat{y}_{i1}\not=\hat{y}_{i2} \\ 0 & \mbox{o.w.} \end{cases}\)` ] ] ] ] --- ## Extensions: .medium_grey[Tree Summaries] **Penguins Example:** .red[Clusters] identified using *multidimensional scaling* with fit metric and .red[representative trees] from clusters based on smallest average fit metric distance to all other trees in cluster .pull-left[ <img src="fig-static/penguin-mds.png" width="75%" style="display: block; margin: auto;" /> <img src="fig-static/penguin-clusters.png" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/penguin-rep-cluster-trees.png" width="65%" style="display: block; margin: auto;" /> ] --- ## Extensions: .medium_grey[Tree Summaries] **Example 1**: Visualizing representative trees with a trace plot .pull-left[ <img src="fig-static/penguin-rep-cluster-trees.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig-static/penguin-rep-trees-trace-plot.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extensions: .medium_grey[Tree Summaries] **Example 2**: Incorporating variability within a cluster <img src="fig-static/penguin-rep-trees-with-var.png" width="100%" style="display: block; margin: auto;" /> --- class: inverse, middle, center # .dark_grey[Music Example:] Application with "larger" random forest --- ## Music Example .pull-left[ **Objective/Response**: - Predict song genre of 40 songs **Features** - 70 numeric variables - Extracted from WAV files (Cook and Swayne, 2007) - Ex: left and right channel frequencies **Model** - Random forest (`randomForest` R package) - Default tuning parameters (e.g., 500 trees) - Out-of-bag class errors: - Classical = 0.15 - New wave = 0.67 - Rock = 0.24 ] .pull-right[ <br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Genre </th> <th style="text-align:center;"> Artist </th> <th style="text-align:center;"> Number of Songs </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Classical </td> <td style="text-align:center;"> Beethoven </td> <td style="text-align:center;"> 6 </td> </tr> <tr> <td style="text-align:center;"> Classical </td> <td style="text-align:center;"> Mozart </td> <td style="text-align:center;"> 5 </td> </tr> <tr> <td style="text-align:center;"> Classical </td> <td style="text-align:center;"> Vivaldi </td> <td style="text-align:center;"> 9 </td> </tr> <tr> <td style="text-align:center;"> New wave </td> <td style="text-align:center;"> Enya </td> <td style="text-align:center;"> 3 </td> </tr> <tr> <td style="text-align:center;"> Rock </td> <td style="text-align:center;"> Abba </td> <td style="text-align:center;"> 6 </td> </tr> <tr> <td style="text-align:center;"> Rock </td> <td style="text-align:center;"> Beatles </td> <td style="text-align:center;"> 6 </td> </tr> <tr> <td style="text-align:center;"> Rock </td> <td style="text-align:center;"> Eels </td> <td style="text-align:center;"> 5 </td> </tr> </tbody> </table> ] --- ## Trace Plot of Model <img src="slides_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> --- ## Average Distances Vertical lines indicate location of smallest average distance plus one standard deviation of distances for a metric <img src="slides_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- ## MDS Results <img src="slides_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- ## Covariate Metric <img src="slides_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> --- ## Fit Metric <img src="slides_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> --- ## Partition Metric <img src="slides_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- ## Interactive Version <iframe src="fig-static/music-trace-plot-int.html" width="1400" height="550" id="igraph" scrolling="no" seamless="seamless" frameBorder="0"> </iframe> --- class: inverse, middle, center # .dark_grey[Conclusions:] Pros, Cons, and Possible Research Directions --- ## Summary **Proposed trace plot extensions** - Structural augmentations - Repurpose trace plots for visualizing tree summaries <br> **Implemented trace plots** - *TreeTracer* R package <br> **Benefits of trace plot extensions** - Help extract patterns from random forest architectures - Inspire new questions and hypotheses --- ## Strenths and Weaknesses .pull-left[ **Strengths** - Added organization of traces - Reduction in the cognitive load - Increased ability to visually compare trees ] .pull-right[ **Weaknesses** - Simplification leads to loss of information - May be worthwhile to view signal among noise - May present a view that is not practically helpful - Not simplified enough - Too much information to expose patterns - Finding optimal balance - Can be challenging - Dependent on model ] --- ## Future Work .pull-left-v2[ **Interactivity** - Link trace plot to visualizations focused on more nuanced aspects of random forests: - Click on intersection of node depth and split variable - Produces plot of split in data space - Zoom in on large trace plots **Computation** - R package for management of tree data - Create a geom for trace plots - Implementation in Python **Other** - Color branches based on dominate class or average value of observations - How to select maximum depth? - Consider other metrics more focused on topology ] .pull-right-v2[ <img src="fig-static/music-trace-plot-int-static.png" width="100%" style="display: block; margin: auto;" /> <br> <br> <img src="fig-static/demo-section-scatter.png" width="100%" style="display: block; margin: auto;" /> .right[.small[Sectioned scatter plot image source: Urbanek (2008)]] ] --- ## References .smallmedium[ Banerjee, M., Y. Ding, and A. Noone (2012). "Identifying representative trees from ensembles". In: _Statistics in Medicine_ 31.15, pp. 1601-1616. ISSN: 1097-0258. DOI: [10.1002/sim.4492](https://doi.org/10.1002%2Fsim.4492). Chipman, H. A., E. I. George, and R. E. McCulloch (1998). "Making sense of a forest of trees". In: _Proceedings of the 30th Symposium on the Interface_. , pp. 84-92. URL: [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598). Cook, D. and D. F. Swayne (2007). _Interactive and Dynamic Graphics for Data Analysis, With R and Ggobi_. 1st ed. Springer-Verlag New York. ISBN: 9780387717616. DOI: [10.1007/978-0-387-71762-3](https://doi.org/10.1007%2F978-0-387-71762-3). French, J. W., R. B. Ekstrom, and L. A. Price (1963). _Kit of reference tests for cognitive factors_. Educational Testing Service. Princeton, NJ. Hare, E., H. Hofmann, and A. Carriquiry (2017). "Automatic matching of bullet land impressions". In: _Annals of Applied Statistics_ 11.4, pp. 2332-2356. DOI: [10.1214/17-AOAS1080](https://doi.org/10.1214%2F17-AOAS1080). Kuznetsova, N. (2014). "Random forest visualization". Supervised by Michel Westenberg. Eindhoven, Netherlands. Miglio, R. and G. Soffritti (2004). "The comparison between classification trees through proximity measures". In: _Computational Statistics & Data Analysis_ 45.3, pp. 577-593. ISSN: 0167-9473. DOI: [10.1016/s0167-9473(03)00063-x](https://doi.org/10.1016%2Fs0167-9473%2803%2900063-x). Shannon, W. D. and D. Banks (1999). "Combining classification trees using MLE". In: _Statistics in Medicine_ 18.6, pp. 727-740. ISSN: 1097-0258. DOI: [10.1002/(sici)1097-0258(19990330)18:6<727::aid-sim61>3.0.co;2-2](https://doi.org/10.1002%2F%28sici%291097-0258%2819990330%2918%3A6%3C727%3A%3Aaid-sim61%3E3.0.co%3B2-2). URL: [https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291097-0258%2819990330%2918%3A6%3C727%3A%3AAID-SIM61%3E3.0.CO%3B2-2](https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291097-0258%2819990330%2918%3A6%3C727%3A%3AAID-SIM61%3E3.0.CO%3B2-2). Sies, A. and I. V. Mechelen (2020). "C443: a Methodology to See a Forest for the Trees". In: _Journal of Classification_ 37.3, pp. 730-753. ISSN: 0176-4268. DOI: [10.1007/s00357-019-09350-4](https://doi.org/10.1007%2Fs00357-019-09350-4). URL: [https://link.springer.com/article/10.1007/s00357-019-09350-4](https://link.springer.com/article/10.1007/s00357-019-09350-4). Urbanek, S. (2008). "Visualizing Trees and Forests". In: _Handbook of Data Visualization_. Ed. by C. Chen, W. Härdle and A. Unwin. Vol. 3. Berlin, Germany: Springer-Verlag, pp. 243-266. ISBN: 9783540330363. URL: [https://haralick.org/DV/Handbook\_of\_Data\_Visualization.pdf](https://haralick.org/DV/Handbook\_of\_Data\_Visualization.pdf). Weinberg, A. I. and M. Last (2019). "Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification". In: _Journal of Big Data_ 6.1, p. 23. DOI: [10.1186/s40537-019-0186-3](https://doi.org/10.1186%2Fs40537-019-0186-3). URL: [https://link.springer.com/article/10.1186/s40537-019-0186-3](https://link.springer.com/article/10.1186/s40537-019-0186-3). ] --- class: inverse, middle, center name: mylastslide # Thank you! <img src="fig-static/penguin-penguin.png" width="30%" style="display: block; margin: auto;" />