Overview

ggResidpanel provides a way to easily create and view diagnostic plots from models in R using ggplot2 graphics. The goal in creating the package was to allow a model to be passed to a function that returns a panel of diagnostic plots that can be viewed simultaneously. The panel allows the user to scan plots of interest to check for violations of model assumptions or lack of fit. The idea to portray the plots in a grid was motivated by the residual panel plots provided in SAS procedures. In addition to being able to view plots in a panel, ggResidpanel allows for the creation of panels with interactive plots and the ability to view plots from multiple models in the same panel. These operations can be obtained by applying one of the four functions listed below to a model.

  • resid_panel: Creates a panel of diagnostic plots of the residuals from a model
  • resid_interact: Creates an interactive panel of diagnostic plots of the residuals from a model
  • resid_xpanel: Creates a panel of diagnostic plots of the predictor variables
  • resid_compare: Creates a panel of diagnostic plots from multiple models

As of now, ggResidpanel allows these functions to work with models of type “lm”, “glm”, “lme”, “lmer”, “glmer”, and “lmerTest”. An additional function is included in the package that can be used with any model type and produces similar output as resid_panel.

  • resid_auxpanel: Creates a panel of diagnostic plots for model types not included in the package

All functions in the package include the ability to select which plots to include in the panel, ways to adjust plot characteristics, and options to change the figure format. Each function has a section in this vignette with details on how to use the function and examples.

Installation

The package can be installed from CRAN, or the development version of the package can be installed from GitHub (if desired). The code below shows how to accomplish both of these tasks.

# Installs ggResidpanel from CRAN
install.packages("ggResidpanel")

# Installs the development version of ggResidpanel from GitHub
devtools::install_github("goodekat/ggResidpanel")

To use the package in R, load the library into your R session with the following code.

# Loads the library
library(ggResidpanel)

Example Data

The functions in this vignette will be demonstrated by using the trees data included in base R. The dataset contains information on the volume, girth, and height of 31 black cherry trees. The first six rows of the data are shown below.

# Loads the dplyr library and displays the first six rows of the dataset
head(trees)

A linear model is fit below to determine if there is a linear relationship between the volume of the tree and its height and girth. This model will be used for examples throughout this vignette.

# Fits a linear model with a response variable of volume and predictor
# variables of height and girth
tree_model <- lm(Volume ~ Height + Girth, data = trees)

resid_panel

Overview

The function resid_panel is applied to a model and returns a panel of diagnostic plots. It currently accepts the following models.

  • “lm”: models fit using the function lm from base R
  • “glm”: models fit using the function glm from base R
  • “lmer”, “glmer”, and “lmerTest”: models fit using the lmer or glmer functions from either the lme4 package or fit with the lmerTest package loaded
  • “lme”: models fit using the lme function from the nlme package

The first argument in resid_panel is the model option. The most basic use of resid_panel is to only include the model in the function. The code below shows the figure that is created if the tree_model is input into resid_panel with no other options specified. This produces a panel with the four plots of a residual plot, a normal quantile plot, an index plot, and a histogram of the residuals.

# Creates the default panel of plots based on the tree_model
resid_panel(tree_model)

Plots

The plots option in resid_panel allows the user to select the designated plots to include in the panel. There are three ways a user can do this.

  1. Specify an individual plot to create one plot
  2. Specify a vector of plots included in the package to create a panel including these plots
  3. Specify the name of a prespecified panel of plots included in the package

Explanations and examples for each of these options are included in the next three sections.

Individual Plots

An individual plot can be created by including the option of plots = "name of plot" in the resid_panel function. The name of the plot must be in quotations. There are currently nine plots included in the package with resid_panel. Their names in the package are as follows.

All plots are available to be used with “lm” and “glm” models, but cookd, lev, and ls are not available to be used with “lmer”, “glmer”, “lmerTest”, and “lme” models. The details and examples of each plot are included below.

boxplot: Boxplot of the Residuals

The option of plots = "boxplot" creates a boxplot of the residuals.

This can be used to visualize the distribution of the residuals from the model. It may help to identify outliers or determine if the distribution of the residuals is skewed.

# Creates a boxplot of the residuals
resid_panel(tree_model, plots = "boxplot")

cookd: Cook’s Distance Plot

The option of plots = "cookd" creates a plot of the Cook’s distance values versus the observation numbers. It is only available for “lm” and “glm” models. The blue dashed horizontal line is placed at 4/n where n is the number of observations used to fit the model (Rawlings, Pantula, and Dickey 1998).

This plot can be used to check for points with high leverage. Points above the dashed blue line are considered to be high leverage points, and points that have Cook’s D values that are much larger than the rest are of particular interest.

# Creates a Cook's D plot
resid_panel(tree_model, plots = "cookd")

hist: Histogram of the Residuals

The option of plots = "hist" creates a histogram of the residuals. The blue line is a normal density curve with a mean of zero and a standard deviation equal to the standard deviation of the residuals.

resid_panel includes a bins option to specify the number of bins in the histogram. By default, bins = 30 is based on the default for the number of bins in the ggplot2 geom_histogram function.

This is another plot that can be used to visualize the distribution of the residuals. In particular, the normal density curve allows for the comparison of the residuals to a normal distribution.

# Creates a histogram of the residuals
resid_panel(tree_model, plots = "hist")

# Creates a histogram with 20 bins
resid_panel(tree_model, plots = "hist", bins = 20)

index: Index Plot of the Residuals

The option of plots = "index" creates a plot of the residuals versus the observation numbers. A solid blue horizontal line through 0 is included for reference.

resid_panel includes a smoother indicator option. If set to TRUE, a loess smoother will included on the index plot as a red solid line. If set to FALSE, it will not be included. By default, smoother = FALSE. (This option also affects the lev, ls, and resid plots.)

This plot can be used to look for patterns in the residuals in regards to the order of the data used to fit the model. Often the data are ordered in a meaningful way such as by time of observation. This plot can help to check if there is any relationship between the residuals and the order of the data. If a trend is found in this plot, it may suggest that a variable has been excluded from the model that would help to explain the variation in the response variable.

# Creates an index plot of the residuals
resid_panel(tree_model, plots = "index")

# Creates an index plot with a smoother added
resid_panel(tree_model, plots = "index", smoother = TRUE)

lev: Residual-Leverage Plot

The option of plots = "lev" creates a plot of the standardized residuals versus the leverage values. This plot is only available for “lm” and “glm” models. A horizontal line through 0 and a vertical line through 0 are included as black dashed lines to mimic the residual-leverage plot created by the plot.lm function from base R. The red dashed lines are Cook’s distance contour lines for Cook’s D values of 0.5 and 1. These values were chosen based on the default options used in plot.lm.

The smoother option in resid_panel also affects the location-scale plot. If set to TRUE, a loess smoother will be included on the residual-leverage plot as a red solid line. If set to FALSE, it will not be included. By default, smoother = FALSE. (This option also affects the index, ls, and resid plots.)

The Cook’s D contour lines are computed using the fact that Cook’s distance can be written as a function of the leverage and the standardized residual. For observation \(i\), let \(D_i\) represent the Cook’s distance, \(r_i\) represent the standardized residual, and \(h_i\) represent the leverage value. Finally, let \(p\) be the rank of the model. Cook’s distance can be computed as \[D_i = \frac{r^2_i}{p}\left(\frac{h_i}{1-h_i}\right).\] (Seber and Lee 2003). Thus, given a specified value of Cook’s D, a leverage value (\(h_i\)), and the rank of the model (\(p\)), it is possible to solve for the value of the standardized residual (\(r_i\)). The value of \(D_i=1\) is used since a data point with a value of Cook’s D larger than 1 is often considered to be a point with high leverage (Chatterjee and Hadi 2012).

This plot can be used to look for trends in the residuals based on the leverage values and to identify points with high leverage. Points that fall outside of the Cook’s D contour lines may be of interest. Points that fall outside of either contour line with Cook’s D set to 1 are considered to be high leverage points. As seen in the plot below, not all contour lines may appear when the plot is created if they fall far outside of the range of the observed leverage values.

# Creates a residual-leverage plot
resid_panel(tree_model, plots = "lev")

# Creates a residual-leverage plot with a smoother added
resid_panel(tree_model, plots = "lev", smoother = TRUE)

ls: Location-Scale Plot

The option of plots = "ls" creates a location-scale plot of the residuals. This plot is only available for “lm” and “glm” models. It plots the square root of the absolute value of the standardized residuals on the y-axis and the predicted values on the x-axis. The predicted values are plotted on the original scale for “glm” and “glmer” models.

The smoother option in resid_panel affects the appearance pf the location-scale plot. If set to TRUE, a loess smoother will be included on the location-scale plot as a red solid line. If set to FALSE, it will not be included. By default, smoother = FALSE. (This option also affects the index, lev, and resid plots.)

The location-scale plot can be used to check for patterns in the residuals in relationship to the predicted values. For example, homogeneity of the residuals can be diagnosed by determining whether the residuals show equal spread along the range of the predicted values. In the ideal situation, the loess curve would be a straight line with points evenly dispersed around it for the whole range of the predicted values.

# Creates a location-scale plot of the residuals
resid_panel(tree_model, plots = "ls")