FORESTR: Searching for patterns in random forests
Talk given at the University of Nebraska-Lincoln Statistics Department Seminar.
Random forests have become a popular tool for data driven predictions and, as a result, are used, or considered for use, in national security mission applications. While individual regression/decision trees are typically considered interpretable, random forests are inherently difficult to interpret due to their ensemble of trees. A lack of model transparency may be less than desirable in high-consequence applications. We aim to increase the interpretability of random forests by finding patterns in the ensemble of trees. As a starting point, we develop a new distance metric for quantifying the similarity between trees based on their topologies (i.e., shapes). We base the metric on a novel distance metric for graphs that is a proper mathematical distance, is invariant to transformations, has registration between graphs, and computes topological evolutions between graphs. The tree distance metric enables computations of tree statistics (e.g., a “mean” tree) and identification of tree clusters. We apply the developed methodology to a toy dataset and a mission relevant product inspection dataset, which demonstrates how the metric provides insight into random forests. Furthermore, we discuss limitations of the approach and ideas for future research.
SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525. SAND2025-02523A.
