As consultants, we encounter projects from many different
disciplines. Our clients are often fitting linear models. They may have
data similar to the ChickWeight
data found in the R
datasets.
library(dplyr)
library(ggplot2)
# Look at the first few rows of the data
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
# Look at the structure of the data
str(data.frame(ChickWeight))
## 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
Suppose that the client would like to determine if there is a difference in average chick weights measured on the final measurement day (day 21) between the diet treatments.
# Subset the data to only inlcude the measurements taken on day 21
# (the final measurement time)
ChickWeight_final <- ChickWeight %>% filter(Time == 21)
# Create boxplots of the chick weights for each diet
ggplot(ChickWeight_final, aes(x = Diet, y = weight)) +
geom_boxplot() +
labs(y = "Weight")
The client may come to us with the following analysis and ask if what they have done is correct.
# Fit a linear model
(lm_model <- lm(weight ~ Diet, data = ChickWeight_final))
##
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight_final)
##
## Coefficients:
## (Intercept) Diet2 Diet3 Diet4
## 177.75 36.95 92.55 60.81
# Look at the anova table with type III sums of squares
# (there are a different number of chicks per treatment)
car::Anova(lm_model, type = "III")
## Anova Table (Type III tests)
##
## Response: weight
## Sum Sq Df F value Pr(>F)
## (Intercept) 505521 1 123.4892 6.069e-14 ***
## Diet 57164 3 4.6547 0.006858 **
## Residuals 167839 41
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the lsmeans and pairwise differences
#emmeans::emmeans(lm_model, pairwise ~ Diet)
What has been done seems reasonable, but the client has forgotten to look at residual plots to check the model assumptions. This is a common oversight.
# Create a residual plot
plot(resid(lm_model) ~ fitted(lm_model))
abline(h = 0)
# Create a normal quantile plot
qqnorm(resid(lm_model))
qqline(resid(lm_model))
plot
function# Create residual diagnostic plots
plot(lm_model)
# Create a dataset with the residuals and fitted values
ggdata <- data.frame(resid = resid(lm_model),
fitted = fitted(lm_model))
head(ggdata)
## resid fitted
## 1 27.25 177.75
## 2 37.25 177.75
## 3 24.25 177.75
## 4 -20.75 177.75
## 5 45.25 177.75
## 6 -20.75 177.75
# Create a residual plot
ggplot(ggdata, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0) +
labs(x = "Fitted Values", y = "Residuals")
# Create a normal quantile plot
library(qqplotr)
ggplot(ggdata, mapping = aes(sample = resid)) +
stat_qq_band() +
stat_qq_line() +
stat_qq_point() +
labs(x = "Theoretical Quantiles", y = "Sample Quantiles")
Many of the clients we work with use SAS. We both like the
residualpanel
plot option in proc mixed, because it
provides a panel of residual diagnostic plots.
The following SAS code fits the same linear model in SAS and outputs the residual panel of plots shown below.
proc import out = ChickWeight_final
datafile = '/folders/myfolders/ggResidpanel example/ChickWeight_final.csv'
dbms = csv
replace;
getnames = yes;
datarow = 2;
run;
proc mixed data = ChickWeight_final plots = residualpanel;
class Diet;
model weight = Diet;
run;
We were inspired by the SAS residual panel to create an R package that provided an easy way to view multiple diagnostic plots created using ggplot2 at the same time. Additionally, we wanted the package to…
This led to the creation of ggResidpanel.
library(ggResidpanel)
resid_panel(lm_model)
Follow these instructions to install ggResidpanel.
# (1) Install the package devtools (if not already installed)
install.packages("devtools")
# (2) Install ggResidpanel from the GitHub repository
devtools::install_github("goodekat/ggResidpanel")
# (3) Load the package
library(ggResidpanel)
resid_panel
Provides a variety of plot and appearance option to help in assessing the model:
resid_panel(model, plots = "default", type = NA, bins = NA, smoother = FALSE, qqline = TRUE,
qqbands = FALSE, scale = 1, theme = "bw", axis.text.size = 10, title.text.size = 12, title.opt = TRUE, ind.ncol = 2)
Currently accepts model types:
lm
: linear modelsglm
: generalized linear modelslmerMod
: mixed linear modelslmerModLmerTest
: mixed linear models using
lmerTest
glmerMod
: mixed generalized linear modelsDefault grid:
resid
: residual plotqq
: normal quantile plotindex
: residual verse observation numberhist
: histogram of residualsThe plots in this default grid were chosen since they can be created for any model/residual type:
library(lme4)
# Generate Poisson data
example_data2 <- data.frame(y = rpois(54, 3),
trt = rep(c("A", "B"), each = 27),
subject = rep(1:18, each = 3))
# Fit a generalized linear mixed effects model with a Poisson family to compare
# the response between the treatments with a random effect for subject to
# account for the dependence within a subject
glmer_model <- glmer(y ~ trt + (1|subject), family = "poisson", data = example_data2)
resid_panel(glmer_model)
The index plot was chosen as part of the default plot for checking for unexpected trends.
For example, a comment we often hear in consulting is that they ‘don’t care’ about that variable so they do not want to include it. This may be a time variable, and the data set is often ordered by time.
#Order Chick data
ChickWeight_ordered <- ChickWeight[order(ChickWeight$Time),]
lm_model2 <- lm(weight~Diet, data = ChickWeight_ordered)
resid_panel(lm_model2)
plots
The user is not limited to these four plots. The full list of available plots is:
boxplot
: A boxplot of residualscookd
: A plot of Cook’s D values versus observation
numbershist
: A histogram of residualsindex
: A plot of residuals versus observation
numbersls
: A location scale plot of the residualsqq
: A normal quantile plot of residualslev
: A plot of leverage values versus residualsresid
: A plot of residuals versus predicted valuesyvp
: A plot of observed response values versus
predicted valuesThere are also two alternative grids available:
R
: shows the four plots created by the
plot
functionSAS
: shows the four plots created by the
residualpanel
option in SASall
: shows all available plots for that model typeresid_panel(lm_model, plots = "R")
resid_panel(lm_model, plots = "SAS")
resid_panel(lm_model, plots = "all")
The user can also specify their own grid:
resid_panel(lm_model, plots = c("resid", "qq", "cookd", "yvp"))
type
There are a variety of options for residuals for each of the model types:
lm
residual options
pearson
:The Pearson residualsresponse
: The raw residuals (Default for
lm
)standardized
: The standardized raw residualsglm
residual options
pearson
: The Pearson residualsdeviance
: The deviance residuals (Default for
glm
)response
: The raw residualsstand.deviance
: The standardized deviance
residualsstand.pearson
: The standardized Pearson residualslmer
and lmerTest
residual options
pearson
: The Pearson residuals (Default for
lmer
and lmerTest
)response
: The raw residualsglmer
residual options
pearson
: The Pearson residualsdeviance
: The deviance residuals (Default for
glmer
)response
: The raw residualsFor example, the plot below creates a residual plot with the standardized residuals:
resid_panel(lm_model, plots = c("resid"), type = "standardized")
This plots looks at the standardized deviance residuals for the
glm
model:
# Fit a generalized linear regression model using a Poisson family to compare
# the insect counts between different sprays from the R "InsectSprays" data
glm_model <- glm(count ~ spray, family = "poisson", data = InsectSprays)
resid_panel(glm_model, plots = "R", type = "stand.deviance")
Note the default is automatically selected and labeled on the plots when fitting a mixed linear model to the chick data to incorporate the time:
lmer_model <- lmer(weight ~ Time + Diet + Time*Diet + (1|Chick), data = ChickWeight_ordered)
resid_panel(lmer_model)
Change the number of bins on the histogram:
resid_panel(glm_model, plots = "hist", bins = 40)
Add a smoother to the residual plot:
resid_panel(lmer_model, smoother = TRUE)
qq
plotqq
plotresid_panel(lmer_model, plots = "SAS", qqline = FALSE, qqbands = TRUE, scale = .75)
resid_panel(glm_model, plots = c("cookd", "resid"), theme = "classic", title.text.size = 16, axis.text.size = 12, nrow = 2, title.opt = FALSE)
resid_interact
The function resid_interact
creates interactive versions
of the residual diagnostic plots. resid_interact
makes use
of plotly to show additional information about a point when the mouse is
hovered over it. The tool tip shows:
All formatting options that were available with
resid_panel
are available with
resid_interact
.
For now, only one plot can be shown at a time. The available plots are as follows:
boxplot
: A boxplot of residualscookd
: A plot of Cook’s D values versus observation
numbershist
: A histogram of residualsindex
: A plot of residuals versus observation
numbersls
: A location scale plot of the residualsqq
: A normal quantile plot of residualslev
: A plot of leverage values versus residualsresid
: A plot of residuals versus predicted valuesyvp
: A plot of observed response values versus
predicted valuesBelow are some interactive versions of residual plots created using
resid_interact
.
resid_interact(lm_model, type = "standardized", title.opt = FALSE)
resid_interact(lm_model, plot = "cookd", theme = "grey")
resid_auxpanel
Since the function resid_panel
only works with certain
model types, we wanted to include an additional function (or an
auxiliary function) that can be used with any model. This led to the
creation of resid_auxpanel
.
Similar to resid_panel
, resid_auxpanel
creates a panel of residual diagnostic plots, but instead of inputting a
model in the function, the residuals and fitted values are input. This
causes some plot options to no longer be available. However, all
formatting options are still available with
resid_auxpanel
.
plots
:all
: This creates a panel of all plot types included in
the package that are available for resid_auxpanel
default
: This creates a panel of a residual plot, a
normal quantile plot of the residuals, an index plot of the residuals,
and a histogram of the residuals. (This is the default option.)SAS
: This creates a panel of a residual plot, a normal
quantile plot of the residuals, a histogram of the residuals, and a
boxplot of the residuals.boxplot
: A boxplot of residualshist
: A histogram of residualsqq
: A normal quantile plot of residualsresid
: A plot of residuals versus predicted valueindex
: A plot of residuals versus observation
numbersLet us fit a random forest model to the mtcars
data to
predict the mpg based on all other variables. We would like to look at
the residuals to see if there is any pattern in the residuals in
relationship to the fitted values. A histogram of the residuals is also
requested.
# Fit a random forest model to the mtcars data to predict the mpg
rf_model <- randomForest::randomForest(x = mtcars[,2:11], y = mtcars[,1])
# Obtain the predictions from the model on the observed data
rf_pred <- predict(rf_model, mtcars[,2:11])
# Obtain the residuals from the model
rf_resid <- mtcars[,1] - rf_pred
# Create a panel with the residual and index plot
resid_auxpanel(rf_resid, rf_pred, plots = c("resid", "index"), theme = "classic")