.light_blue[Tracing Trees]

.title[
# <br> <br> <br> .light_blue[Tracing Trees]
]
.subtitle[
## .ice_blue[Visualizing Random Forest Tree Variability with Trace Plots]
]
.author[
### <br>
]
.author[
### Katherine Goode (5573)
]
.author[
### <a href="mailto:kjgoode@sandia.gov" class="email">kjgoode@sandia.gov</a>
]
.date[
### July 11, 2022 <br> <br> <br> <br> <br> .tiny[Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc. for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND 2022-12697 O.]
]

---

.pull-left-v2 {
  width: 60%;
  height: 92%;
  float: left;
}
.pull-right-v2 {
  width: 35%;
  float: right;
  padding-left: 1%;
}

.pull-left-v3 {
  width: 75%;
  height: 92%;
  float: left;
}
.pull-right-v3 {
  width: 24%;
  float: right;
  padding-left: 1%;
}

.pull-left-v4 {
  width: 39%;
  height: 92%;
  float: left;
}
.pull-right-v4 {
  width: 60%;
  float: right;
  padding-left: 1%;
}
</style>

## Introduction to Katherine

Research and development statistician in 5573

**Education**

- BA in mathematics from [Lawrence University](https://www.lawrence.edu/academics/college/mathematics)
- MS in statistics from [University of Wisconsin - Madison](https://stat.wisc.edu/)
- PhD in statistics from [Iowa State University](https://www.stat.iastate.edu/)

**Sandia Journey**

- Dec 2019: Intern (mentored by Daniel Ries, 5574)
- Sep 2021: Post-doc (mentored by J. Derek Tucker, 5573)
- Dec 2021: FTE (mentored by J. Derek Tucker, 5573)

**Research Interests**

- Explainable machine learning
- Data visualization
- Model assessment

Personal website: [goodekat.github.io](https://goodekat.github.io)  
Personal github: [github.com/goodekat](https://github.com/goodekat)

]

.pull-right-v3[
<img src="fig-static/kat-mina.JPEG" width="90%" style="display: block; margin: auto;" />

<br>

<img src="fig-static/kat-kayak.JPG" width="90%" style="display: block; margin: auto;" />
]

---

## Overview

<br>

- **Background**: Trace Plots

<br>

- **Methods**: Extending Trace Plots

- .medium_grey[TreeTracer]: Implementation and Structural Augmentations in R

- .medium_grey[Tree Summaries]: Identifying Representative Trees

<br>

- **Music Example**: Application with "larger" random forest

<br>

- **Conclusions**: Pros, Cons, and Possible Research Directions

<br>

---

# .dark_grey[Background:] Trace Plots

---

## Common Tree Visualization

<br>

---

## Visual Comparisons of Multiple Trees

- Difficult direct visual comparison
 
 - Non-efficient use of space
 
 - Identifying patterns is cognitively difficult (figure classification .small[French, Ekstrom, and Price (1963)])
   
<br>

]

]

---

## Trace Plots (one tree) [.small[Urbanek (2008)]](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11)

---

## Trace Plots (esemble of trees) [.small[Urbanek (2008)]](https://link.springer.com/chapter/10.1007/978-3-540-33037-0_11)

Designed to compare (1) variables used for splitting, (2) location of split points, and (3) hierarchical structure

---

## Limitations of Trace Plots

Example: 
  
- **Objective**: Are two bullets fired from same gun?

- **Model**: Random forest (300 trees) .small[Hare, Hofmann, and Carriquiry (2017)]

- **Response variable**: Same gun?

- **Predictor variables**: 9 characteristics comparing two signatures such as cross correlation function (CCF)

.pull-left[  
<img src="fig-static/bullet-bullets.png" width="70%" style="display: block; margin: auto;" />
]

.pull-right[  
<img src="fig-static/bullet-signatures.png" width="100%" style="display: block; margin: auto;" />
]

---

## Limitations of Trace Plots

.pull-left[
<img src="fig-static/bullet-trace-plot-full.png" width="40%" style="display: block; margin: auto;" />
]

**Info gained**

- Deep trees (max node depth of 39)

- Certain variables more commonly used for first split

- All variables commonly used between node depths of 3 and 30

<br>

**Difficult to extract patterns when...**

- Many trees in a forest

- Deep trees

- Large number of predictors

]

---

# .dark_grey[Methods:] Extending Trace Plots

---

## Overview

**Objective**: Extend trace plots to improve ability to find patterns in random forest architecture

| | Intentions |
| --- | :------ |
| **Who** <br> <br> | Data analysts <br> <br> |
| **What** <br> <br> | - Visualization of random forest architecture <br> - .red[One tool in toolbox for explaining random forests] <br> <br> |
| **When/Where** <br> <br> | - After model training <br> - Model assessment <br> - Model "explanation" <br> <br> |
| **Why** <br> <br> | - Help understand how variables are used <br> - Compare variability in split locations at different node depths <br> - Identify patterns to explore further <br> <br> |
| **How** <br> <br> | Using `TreeTracer` R package <br> <br> |

---

## Approaches

- Highlight patterns

- Lessen cognitive load

]

- Identify summary trees

- Re-purpose trace plots for highlighting summary trees

]

---

## Example: Palmer Penguins

- **Data**: 342 penguins from Palmer Archipelago in Antarctica

- **Three species**: Adelie, Chinstrap, and Gentoo

- **Four body measurements**: Bill length, bill depth, flipper length, body mass

- **Random Forest**: Predict species using 50 trees

.pull-left-v2[
<img src="slides_files/figure-html/unnamed-chunk-14-1.png" width="80%" style="display: block; margin: auto;" />

|          | Adelie| Chinstrap| Gentoo| Class Error|
|:---------|------:|---------:|------:|-----------:|
|Adelie    |    146|         4|      1|        0.03|
|Chinstrap |      4|        64|      0|        0.06|
|Gentoo    |      0|         1|    122|        0.01|
]

]

<br>

]

---

## Implementation of trace plots (and extensions)

**Overview**

- R package `TreeTracer`

- First readily available implementation in R

- GitHub repo: [https://github.com/goodekat/TreeTracer](https://github.com/goodekat/TreeTracer)

<br>

**Functions**

- Create trace plots from `randomForest` R package

- Structural augmentations

- Compute distances between trees

]

.pull-right[
<img src="fig-static/penguin-trace.png" width="90%" style="display: block; margin: auto;" />
]

---

## Extensions: .red[Structural Augmentations]

**Ordering of split variables**: Provides different perspectives

---

## Extensions: .red[Structural Augmentations]

**Subsets of trees**: Lessen cognitive load

---

## Extensions: .red[Structural Augmentations]

**Facets**: Separate trees using facets

**Use of color and line size**: Highlight individual or groups of trees

---

## Extensions: .red[Structural Augmentations]

**Maximum node depth**: Focus on upper node depths where global structures may exist (e.g., considering the "canopy")

---

## Extensions: .medium_grey[Tree Summaries]

**Background (summarizing tree ensembles)**

.red[Representative tree] .small[(Shannon and Banks, 1999; Banerjee, Ding, and Noone, 2012; Weinberg and Last, 2019)]

- Identify a tree that is representative of the forest

- One approach: Find tree that has smallest average distance to all other trees

]

- Compute distances between trees

- Identify clusters via MDS, K-means, etc.
  
<br>

]

---

## Extensions: .medium_grey[Tree Summaries]

**Benefits of trace plots:** Example of representative trees from clusters within a tree ensemble .small[(Chipman, George, and McCulloch, 1998; Sies and Mechelen, 2020)]

.pull-left[
<img src="fig-static/demo-chipman.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="fig-static/demo-rep-trees.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extensions: .medium_grey[Tree Summaries]

**Benefits of trace plots:** Two scenarios of visualizing representative trees with variability

.pull-left[
<img src="fig-static/demo-var-small.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="fig-static/demo-var-large.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extensions: .medium_grey[Tree Summaries]

**Background (distances between trees)**: Various metrics proposed .small[(Chipman, George, and McCulloch, 1998; Shannon and Banks, 1999; Miglio and Soffritti, 2004; Banerjee, Ding, and Noone, 2012; Sies and Mechelen, 2020)]

<img src="fig-static/demo-predictions.png" width="100%" style="display: block; margin: auto;" />
]

]

---

## Extensions: .medium_grey[Tree Summaries]

**Example Distance Metrics**

.medium[
.red[Covariate metric:] Compares split variables from two trees .small[(Banerjee, Ding, and Noone, 2012)]

.medium[$$d_{CM}(T_1, T_2)=\frac{\mbox{Number of covariate mismatches between } T_1 \mbox{ and } T_2}{k}.$$]

<br>

.medium[
`$$d_{FM}\left(T_1,T_2\right)=\frac{1}{n}\sum_{i=1}^n m\left(\hat{y}_{i1},\hat{y}_{i2}\right)$$`
]

<br>

.red[Partition metric:] Compares how observations are divided between leaves .small[(Chipman, George, and McCulloch, 1998)]

.medium[
`$$d_{PM}\left(T_1, T_2\right)=\frac{\sum_{i>j}\left|I_1(i,j)-I_2(i,j)\right|}{{n\choose2}}$$`
]

.medium[
`$$I_t(i,j) =\begin{cases} 1 & \mbox{if } T_t \mbox{ places observations } i \mbox{ and } j \mbox{ in the same terminal node} \\ 0 & \mbox{o.w.} \end{cases}$$`
]
]

.medium[
.grey[
Details:
.medium[
.pull-left[
- Observation: `$i$` with `$i\in\{1,...,n\}$` or `$j$` with `$j\in\{1,...,n\}$`
- Response: `$y_i$`
- Predictor variables: `$\textbf{x}_{i}=(x_{i1},...,x_{ik})$`
- Fitted value: `$\hat{y}_{it}$`
- Trees: `$T_t$` with `$t\in\{1,2\}$`
]
.pull-right[
- Metric: `$m$`
  - Regression: `$m\left(\hat{y}_{i1},\hat{y}_{i2}\right)=\left(\hat{y}_{i1}-\hat{y}_{i2}\right)^2$`
  - Classification: `$m\left(\hat{y}_{i1},\hat{y}_{i2}\right)=\begin{cases} 1 & \mbox{if} \ \ \hat{y}_{i1}\not=\hat{y}_{i2} \\ 0 & \mbox{o.w.} \end{cases}$`
] 
]
]

]
---

## Extensions: .medium_grey[Tree Summaries]

**Penguins Example:** .red[Clusters] identified using *multidimensional scaling* with fit metric and .red[representative trees] from clusters based on smallest average fit metric distance to all other trees in cluster

]

]

---

## Extensions: .medium_grey[Tree Summaries]

**Example 1**: Visualizing representative trees with a trace plot

.pull-left[
<img src="fig-static/penguin-rep-cluster-trees.png" width="70%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="fig-static/penguin-rep-trees-trace-plot.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extensions: .medium_grey[Tree Summaries]

**Example 2**: Incorporating variability within a cluster

---

# .dark_grey[Music Example:] Application with "larger" random forest

---

## Music Example

**Objective/Response**:

- Predict song genre of 40 songs

**Features**

- 70 numeric variables 
- Extracted from WAV files (Cook and Swayne, 2007)
- Ex: left and right channel frequencies

**Model**

- Random forest (`randomForest` R package)
- Default tuning parameters (e.g., 500 trees)
- Out-of-bag class errors: 
    - Classical = 0.15 
    - New wave = 0.67
    - Rock = 0.24
]

.pull-right[
<br>
<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:center;"> Genre </th>
   <th style="text-align:center;"> Artist </th>
   <th style="text-align:center;"> Number of Songs </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Classical </td>
   <td style="text-align:center;"> Beethoven </td>
   <td style="text-align:center;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Classical </td>
   <td style="text-align:center;"> Mozart </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Classical </td>
   <td style="text-align:center;"> Vivaldi </td>
   <td style="text-align:center;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> New wave </td>
   <td style="text-align:center;"> Enya </td>
   <td style="text-align:center;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Rock </td>
   <td style="text-align:center;"> Abba </td>
   <td style="text-align:center;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Rock </td>
   <td style="text-align:center;"> Beatles </td>
   <td style="text-align:center;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Rock </td>
   <td style="text-align:center;"> Eels </td>
   <td style="text-align:center;"> 5 </td>
  </tr>
</tbody>
</table>
]

---

## Trace Plot of Model

---

## Average Distances

Vertical lines indicate location of smallest average distance plus one standard deviation of distances for a metric
 
<img src="slides_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" />

---

## MDS Results

---

## Covariate Metric

---

## Fit Metric

---

## Partition Metric

---

## Interactive Version

---

# .dark_grey[Conclusions:] Pros, Cons, and Possible Research Directions

---

## Summary

**Proposed trace plot extensions**

- Structural augmentations
  
- Repurpose trace plots for visualizing tree summaries

<br>

**Implemented trace plots**

- *TreeTracer* R package

<br>

**Benefits of trace plot extensions**

- Help extract patterns from random forest architectures

- Inspire new questions and hypotheses

---

## Strenths and Weaknesses

- Added organization of traces

- Reduction in the cognitive load

- Increased ability to visually compare trees

]

**Weaknesses**

- Simplification leads to loss of information

- May be worthwhile to view signal among noise
  
  - May present a view that is not practically helpful
  
- Not simplified enough

- Too much information to expose patterns
  
- Finding optimal balance

- Can be challenging
  
  - Dependent on model
  
]

---

## Future Work

- Link trace plot to visualizations focused on more nuanced aspects of random forests: 
  - Click on intersection of node depth and split variable 
  - Produces plot of split in data space
- Zoom in on large trace plots
  
**Computation**

- R package for management of tree data
- Create a geom for trace plots 
- Implementation in Python

**Other**

- Color branches based on dominate class or average value of observations
- How to select maximum depth? 
- Consider other metrics more focused on topology
]

.pull-right-v2[
<img src="fig-static/music-trace-plot-int-static.png" width="100%" style="display: block; margin: auto;" />

---

## References

.smallmedium[
Banerjee, M., Y. Ding, and A. Noone (2012). "Identifying representative
trees from ensembles". In: _Statistics in Medicine_ 31.15, pp.
1601-1616. ISSN: 1097-0258. DOI:
[10.1002/sim.4492](https://doi.org/10.1002%2Fsim.4492).

Chipman, H. A., E. I. George, and R. E. McCulloch (1998). "Making sense
of a forest of trees". In: _Proceedings of the 30th Symposium on the
Interface_. , pp. 84-92. URL:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.2598).

Cook, D. and D. F. Swayne (2007). _Interactive and Dynamic Graphics for
Data Analysis, With R and Ggobi_. 1st ed. Springer-Verlag New York.
ISBN: 9780387717616. DOI:
[10.1007/978-0-387-71762-3](https://doi.org/10.1007%2F978-0-387-71762-3).

French, J. W., R. B. Ekstrom, and L. A. Price (1963). _Kit of reference
tests for cognitive factors_. Educational Testing Service. Princeton,
NJ.

Hare, E., H. Hofmann, and A. Carriquiry (2017). "Automatic matching of
bullet land impressions". In: _Annals of Applied Statistics_ 11.4, pp.
2332-2356. DOI:
[10.1214/17-AOAS1080](https://doi.org/10.1214%2F17-AOAS1080).

Kuznetsova, N. (2014). "Random forest visualization". Supervised by
Michel Westenberg. Eindhoven, Netherlands.

Miglio, R. and G. Soffritti (2004). "The comparison between
classification trees through proximity measures". In: _Computational
Statistics & Data Analysis_ 45.3, pp. 577-593. ISSN: 0167-9473. DOI:
[10.1016/s0167-9473(03)00063-x](https://doi.org/10.1016%2Fs0167-9473%2803%2900063-x).

Shannon, W. D. and D. Banks (1999). "Combining classification trees
using MLE". In: _Statistics in Medicine_ 18.6, pp. 727-740. ISSN:
1097-0258. DOI:
[10.1002/(sici)1097-0258(19990330)18:6<727::aid-sim61>3.0.co;2-2](https://doi.org/10.1002%2F%28sici%291097-0258%2819990330%2918%3A6%3C727%3A%3Aaid-sim61%3E3.0.co%3B2-2).
URL:
[https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291097-0258%2819990330%2918%3A6%3C727%3A%3AAID-SIM61%3E3.0.CO%3B2-2](https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291097-0258%2819990330%2918%3A6%3C727%3A%3AAID-SIM61%3E3.0.CO%3B2-2).

Sies, A. and I. V. Mechelen (2020). "C443: a Methodology to See a
Forest for the Trees". In: _Journal of Classification_ 37.3, pp.
730-753. ISSN: 0176-4268. DOI:
[10.1007/s00357-019-09350-4](https://doi.org/10.1007%2Fs00357-019-09350-4).
URL:
[https://link.springer.com/article/10.1007/s00357-019-09350-4](https://link.springer.com/article/10.1007/s00357-019-09350-4).

Urbanek, S. (2008). "Visualizing Trees and Forests". In: _Handbook of
Data Visualization_. Ed. by C. Chen, W. Härdle and A. Unwin. Vol. 3.
Berlin, Germany: Springer-Verlag, pp. 243-266. ISBN: 9783540330363.
URL:
[https://haralick.org/DV/Handbook\_of\_Data\_Visualization.pdf](https://haralick.org/DV/Handbook\_of\_Data\_Visualization.pdf).

Weinberg, A. I. and M. Last (2019). "Selecting a representative
decision tree from an ensemble of decision-tree models for fast big
data classification". In: _Journal of Big Data_ 6.1, p. 23. DOI:
[10.1186/s40537-019-0186-3](https://doi.org/10.1186%2Fs40537-019-0186-3).
URL:
[https://link.springer.com/article/10.1186/s40537-019-0186-3](https://link.springer.com/article/10.1186/s40537-019-0186-3).
]

---

# Thank you!