MAJOR UPDATE

As of Fall 2021, this wiki has been discontinued and is no longer being actively developed.

All updated materials and announcements for the QCBS R Workshop Series are now housed on the QCBS R Workshop website. Please update your bookmarks accordingly to avoid outdated material and/or broken links.

Thank you for your understanding,

Your QCBS R Workshop Coordinators.

QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 4: Linear models

Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu, Shaun Turney, Willian Vieira

Summary: In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you!

Link to new Rmarkdown presentation

Link to old Prezi presentation

Download the R script and data for this lesson:

  1. Simple linear regression
  2. T-test
  3. ANOVA
  4. Two-way ANOVA
  5. Unbalanced ANOVA
    (advanced section/ optional)
  6. ANCOVA
  7. Multiple linear regression

1.1 Defining mean and variation

Scientists have always been interested in determining relationships between variables. In this workshop we will learn how to use linear models, a set of models that quantify relationships between variables.

To begin, we will define some important concepts that are central to linear models: mean and variation. Mean is a measure of the average value of a population. Suppose we have a random variable x, for example the height of the people in this room, and we would like to see some patterns of this variable. The first way we will present it will be by using the mean (be aware that there are many ways of measuring it):

But the mean alone will not represent it a population fully. We can also describe the population using measures of variation. Variation is the spread around the mean. For example, whether all people in the room are approximately the same height (low variation) or if there are many tall and short people (high variation). Mean deviation, variance, standard deviation and the coefficient of deviation are all measures of variation which we will define below. We can measure deviation of each element from the mean:

With the deviation for each value, we can calculate the mean deviation:

To convert all variables to positive values without the need of absolute values, we can also square the value. And that is where the variance comes from:

However, by squaring each value, our variables are no longer in meaningful units. Back to our example with the height of people in this room, the variance will be $ mˆ2 $ which is not what we’re measuring. To convert the value to meaningful units, we can calculate the standard deviation:

Finally, to express the coefficient of variation, known as the relative standard deviation, expressed in percentage, we have:

1.2 Linear models

In linear models, we use the concepts of mean and variation to describe the relationship between two variables. Linear models are so named because they describe the relationship between variables as lines:

{y_i} = {β_0} + {β_1}{x_{i1}} + ... + {β_p}{x_{ip}} + {ε_i}

where

{y_i} is the response variable,
{β_0} is the intercept of the regression line,
{β_1} is the coefficient of variation for the first explanatory variable,
{β_p} is the coefficient of variation for the pth explanatory variable,
{x_i1} is the first explanatory variable,
{x_ip} is the pth explanatory variable,
{ε_i} are the residuals of the model

The response variable is the variable you want to explain. It's also known as the dependent variable. There is only one response variable. The explanatory variables are the variables you think may explain your response variable. They're also known as independent variables. There can be one or many explanatory variables. For example, suppose we want to explain variation in the height of people in a room. Height is the response variable. Some possible explanatory variables could be gender or age.

In linear models the response variable must be continuous, while the explanatory variables can be continuous or categorical. A continuous variable has an infinite number of possible values. A categorical variable has a limited number of possible values. Age, temperature, and latitude are all continuous variables. Sex, developmental stage, and country are all categorical variables. For continuous explanatory variables, the linear model tests whether there is a significant correlation between the explanatory and response variable. For categorical explanatory variables, the linear model tests whether there is a significant difference between the different levels (groups) in their mean value of the response variable. This should become clearer as we learn about specific types of linear models in the sections below.

In almost all cases, the explanatory variables will not explain all of the variation in the response variable. Gender and age, for example, will not be enough to predict everyone's height perfectly. The remaining, unexplained variation is called error or residuals.

The goal of the linear model is to find the best estimation of the parameters (the β variables) and then assess the goodness of fit of the model. Several methods have been developed to calculate the intercept and coefficients of linear models, and the appropriate choice depends on the model. The general concept behind these methods is that the residuals are minimized.

Depending on the kind of explanatory variables considered and their number, different statistical tools can be used to assess these relationships. The table below lists the five types of statistical analysis that will be covered in this workshop:

Statistical analysis Type of response
variable Y
Type of explanatory
variable X
Number of
explanatory variables
Number of
levels k
Simple linear regression Continuous Continuous 1
t-test Categorical 1 2
ANOVA Categorical 1 (one-way ANOVA), 2 (two-
way ANOVA) or more
3 or more
ANCOVA Continuous
AND categorical
2 or more 2 or more
Multiple regression Continuous 2 or more

1.3 Linear model assumptions

To be valid, a linear models must meet 4 assumptions, otherwise the model results cannot be safely interpreted.

  1. The residuals are independent
  2. The residuals are normally distributed
  3. The residuals have a mean of 0
  4. The residuals are homoskedastic (they have constant variance)

Note that all of these assumptions concern the residuals, not the response or explanatory variables. The residuals must be independent, meaning that there isn't an underlying structure missing from the model (usually spatial or temporal autocorrelation). The residuals must be normally distributed with a mean of 0, meaning that the largest proportion of residuals have a value close to 0 (ie, the error is small) and the distribution is symmetrical (ie, the response variable is overestimated and underestimated equally often). The residuals must be homoskedastic, meaning that error doesn't change much as the value of the predictor variables change.

In the following sections, we do not always explicitly restate the above assumptions for every model. Be aware, however, that these assumption are implicit in all linear models, including all models presented below.

1.4 Test statistics and p-values

Once you've run your model in R, you will receive a model output that includes many numbers. It takes practice to understand what each of these numbers means and which to pay the most attention to. The model output includes the estimation of the parameters (the β variables). The output also includes test statistics. The particular test statistic depends on the linear model you are using (t is the test statistic for the linear regression and the t test, and F is the test statistic for ANOVA).

In linear models, the null hypothesis is typically that there is no relationship between two continuous variables, or that there is no difference in the levels of a categorical variable. The larger the absolute value of the test statistic, the more improbable that the null hypothesis is true. The exact probability is given in the model output and is called the p-value. You could think of the p-value as the probability that the null hypothesis is true, although that's a bit of a simplification. (Technically, the p-value is the probability that, given the assumption that the null hypothesis is true, the test statistic would be the same as or of greater magnitude than the actual observed test statistic.) By convention, we consider that if the p value is less than 0.05 (5%), then we reject the null hypothesis. This cut-off value is called α (alpha). If we reject the null hypothesis then we say that the alternative hypothesis is supported: there is a significant relationship or a significant difference. Note that we do not “prove” hypotheses, only support or reject them.

1.5 Work flow

Below we will explore several kinds of linear models. The way you create and interpret each model will differ in the specifics, but the principles behind them and the general work flow will remain the same. For each model we will work through the following steps:

  1. Visualize the data (data visualization could also come later in your work flow)
  2. Create a model
  3. Test the model assumptions
  4. Adjust the model if assumptions are violated
  5. Interpret the model results

Simple linear regression is a type of linear model which contains a single, continuous explanatory variable. The regression tests whether there is a significant correlation between the two variables.

Simple linear regression involves two parameters which must be estimated: an intercept ({β_0}) and a coefficient of correlation ({β_1}). Ordinary least squares is the most widely used estimation method, and also corresponds to the default method of the lm function in R. Ordinary least squares fits a line such that the sum of the squared vertical distances between the observed data and the linear regression model (i.e. the residuals) is minimized.

Click below to see the math in more detail.

Click to display ⇲

Click to hide ⇱

Using ordinary least squares, the intercept β0 and the slope β1 of the linear regression can be calculated as:

β_{1}={sum{i}{}{(x_{i}y_{i})}-overline{x}overline{y}}/sum{i}{}{(x_{i}-overline{x})}^2 = {Cov(x,y)}/{Var(x)}

β_{0}=overline{y}-β_{1}overline{x}

2.1 Running a linear model

Using the bird dataset, we will first examine the linear regression of maximum abundance as a function of mass.

In R, linear regression is implemented using the lm function from the stats package:

lm (y ~ x)

Note: before using a new function in R, users should refer to its help documentation (?functionname) to find out how to use the function as well as its preset default methods.
| Load and explore your data
# Loading libraries and bird dataset 
library(e1071)
library(MASS)
setwd("~/Desktop/...") # Don't forget to set your working directory (note: your directory will be different)
bird<-read.csv("birdsdiet.csv") 
 
# Visualize the dataframe
names(bird)
str(bird)
head(bird)
summary(bird) 
plot(bird)

The bird dataset contains 7 variables:

Variable Name Description Type
Family Common name of family String
MaxAbund The highest observed abundance at any site
in North America
Continuous/ numeric
AvgAbund The average abundance across all sites
where found in NA
Continuous/ numeric
Mass The body size in grams Continuous/ numeric
Diet Type of food consumed Discrete – 5 levels (Plant; PlantInsect;
Insect; InsectVert; Vertebrate)
Passerine Is it a songbird/ perching bird Boolean (0/1)
Aquatic Is it a bird that primarily lives in/ on/ next to the water Boolean (0/1)

Note that Family, Diet, Passerine, and Aquatic are all categorical variables although they are encoded in different ways (string, discrete, boolean).

We are now ready to run our linear model:

| Regression of Maximum Abundance on Mass
lm1 <- lm(bird$MaxAbund ~ bird$Mass) # where Y ~ X means Y "as a function of" X>

2.2 Verifying assumptions

| Diagnostic plots
opar <- par(mfrow=c(2,2)) # draws subsequent figures in a 2-by-2 panel
plot(lm1)
par(opar) # resets to 1-by-1 plot

Verifying independence

Linear models can only be applied to independent data. This means that the yi at a given xi value must not be influenced by other xi values. Violation of independence can happen if your data represent some form of dependence structure, such as spatial or temporal correlation.

There is no simple diagnostic plot for independence, unfortunately. Instead, you must consider your data carefully. Is there some underlying structure in your data that makes your data points dependent on each other? If you collect data from the same sites over time (ie, a time series) or if you collect multiple data points from the same organism, your data violates the assumption of independence. You will need to use a different type of model instead.

Verifying residual variance is constant and residual mean is 0

Residual vs Fitted plot - The first graph of the diagnostic plots is called by plot(lm1). This plot illustrates the spread of the residuals between each fitted values. Each point represents the distance of the response variable from the model prediction of the response variable. If the residuals spread randomly around the 0 line, this indicates that the relationship is linear and that the mean of the residuals is 0. If the residuals form an approximate horizontal band around the 0 line, this indicates homogeneity of error variance (ie, it is homoskedastic). If the residuals form a funnel shape, this indicates the residuals are not homoskedastic.

Scale-location plot - The third graph of the diagnostic plots enables one to verify whether the residual spread increases with a given fitted values (i.e. identifies whether the spread in the residuals is due to the selected explanatory variable). If the spread increases, the homoscedasticity assumption is not respected.

Verifying that residuals are normally distributed

QQ plot - Normality can be assessed from the QQplot of the diagnostic plots. This graph compares the probability distribution of the model residuals to the probability distribution of normal data series. If the standardized residuals lie linearly on the 1:1 line of the QQplot, the residuals can be considered normally distributed.

The points of the QQplot are nonlinear, which suggests that the residuals are not normally distributed.

Checking for high leverage

In addition to the assumption testing above, we are also interested in whether any of our data points have high leverage. This is not assumption testing per se, but it will affect our interpretation of the data. If some of the observations in a dataset possess strongly different values from others, a model fitting problem can arise such that these high leverage data influence the model calculation.

Residuals vs Leverage plot - High leverage data can be visualised on the fourth diagnostic plots (i.e. residuals vs leverage), which identifies the observation numbers of the high leverage data point(s). If (and only if!) these observations correspond to mis-measurements or represent exceptions, they can be removed from the original dataset.

2.3 Normalizing data

In the example provided above, the model residuals were not normally distributed and therefore the assumption of residual normality is violated. We may still be able to use a linear regression model if we can address this violation. The next step is to try to normalize the variables using transformations. Often if we can make the explanatory and/or response variables normally distributed then the model residuals will become normally distributed. In addition to QQ-plots we can assess the normality of a variable by drawing a histogram using the function hist, and check visually whether the data series appears to follow a normal distribution. For example:

| Testing Normality: hist() function
# Plot Y ~ X and the regression line
plot(bird$MaxAbund ~ bird$Mass, pch=19, col="coral", ylab="Maximum Abundance", 
     xlab="Mass")
abline(lm1, lwd=2) 
?plot # For further details on plot() arguments
# see colours() for list of colours
 
# Is the data normally distributed?
hist(bird$MaxAbund,col="coral", main="Untransformed data", 
     xlab="Maximum Abundance")
hist(bird$Mass, col="coral", main="Untransformed data", xlab="Mass")

A third way to assess normality is to use the Shapiro-Wilk normality test that compares the distribution of the observed data series to a normal distribution using the function shapiro.test.

The null and alternate hypotheses of this test are:

H0: the observed data series is normally distributed,
H1: the observed data series is not normally distributed,

The observed data series can be considered normally distributed when the p-value calculated by the Shapiro-Wilk normality test is greater than or equal to α, typically set to 0.05.

Testing Normality: shapiro.test() function
# Test null hypothesis that the sample came from a normally distributed population 
shapiro.test(bird$MaxAbund) 
shapiro.test(bird$Mass) 
# If p < 0.05, then distribution is not-normal
# if p > 0.05, then distribution is normal

We can also evaluate the skewness of each distribution using the Skewness function:

Testing Normality: skewness() function
skewness(bird$MaxAbund) 
skewness(bird$Mass) 
# where positive values indicate a left-skewed distribution, and negative value a right skew.

The histograms, Shapiro tests and Skewness all indicate that the variables need to be transformed to normalize (e.g. a log10 transformation).

2.4 Data transformation

In case of non-normality, response and explanatory variables can be transformed to enhance their normality following these rules:

Type of distribution Transformation R function
Moderately positive skewness sqrt{x} sqrt(x)
Substantially positive skewness log_10{(x)}log10(x)
Substantially positive skewness log_10{(x+C)}log10(x + C) where C is a constant
added to each value of x so that
the smallest score is 1
Moderately negative skewness sqrt{(K-x)} sqrt(K - x) where K is a constant
subtracted from each value of x
so that the smallest score is 1
Substantially negative skewness log_10{(K-x)} log10(K - x)

Thus, log10 transformations should be applied and saved in the bird data frame. The model can then be re-run, verified and interpreted.

| Data Transformation
# Add log10() transformed variables to your dataframe
bird$logMaxAbund <- log10(bird$MaxAbund)
bird$logMass <- log10(bird$Mass)
names(bird) # to view the dataframe + new transformed variables
 
hist(bird$logMaxAbund,col="yellowgreen", main="Log transformed", 
     xlab=expression("log"[10]*"(Maximum Abundance)"))
hist(bird$logMass,col="yellowgreen", main="Log transformed",
     xlab=expression("log"[10]*"(Mass)"))
shapiro.test(bird$logMaxAbund); skewness(bird$logMaxAbund)
shapiro.test(bird$logMass); skewness(bird$logMass)
 
# Re-run your analysis with the appropriate transformations
lm2 <- lm(bird$logMaxAbund ~ bird$logMass)
 
# Are there remaining problems with the diagnostics (heteroscedasticity, non-independence, high leverage)?
opar <- par(mfrow=c(2,2))
plot(lm2, pch=19, col="gray")
par(opar)

2.5 Model output

Once all these assumptions have been verified, the model results can be interpreted. These results are called in R using the function summary.

| Summary output
# Now we can look at the model coefficients and p-values
summary(lm2)
 
# You can also just call up the coefficients of the model
lm2$coef
 
# What else?
str(summary(lm2))
summary(lm2)$coefficients # where Std. Error is the standard error of each estimate
summary(lm2)$r.squared # Coefficient of determination
summary(lm2)$adj.r.squared # Adjusted coefficient of determination
summary(lm2)$sigma # Residual standard error (square root of Error Mean Square)
# etc…
 
# You can also check for yourself the equation for R2:
SSE = sum(resid(lm2)^2)
SST = sum((bird$logMaxAbund - mean(bird$logMaxAbund))^2)
R2 = 1 - ((SSE)/SST)
R2

The output of this function presents all the results of your validated model:

lm(formula = logMaxAbund ~ logMass, data = bird)
Residuals:
  		Min       	1Q   		Median       	3Q      	Max 
               -1.93562 	-0.39982  	0.05487  	0.40625  	1.61469
          	Estimate 	Std. Error 	t value 	Pr(>|t|)    
(Intercept)     1.6724     	0.2472   	6.767 		1.17e-08 ***
logMass	       -0.2361      	0.1170  	-2.019   	0.0487 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6959 on 52 degrees of freedom
Multiple R-squared:  0.07267,   Adjusted R-squared:  0.05484 
F-statistic: 4.075 on 1 and 52 DF,  p-value: 0.04869

The coefficients of the regression model and their associated standard error appear in the second and third columns of the regression table, respectively. Thus,

β0 = 1.6724 ± 0.2472 is the intercept (± se) of the regression model,
β1 = -0.2361 ± 0.1170 is the slope (± se) of the regression model.

and finally: logMaxAbund = 1.6724 (± 0.2472) - 0.2361 (± 0.1170) x logMass

The t-value and their associated p-value (in the fourth and fifth columns of the regression table, respectively) test for a significant difference between the calculated coefficients and zero. In this case, we can see that logMass has a significant influence on logMaxAbund because the p-value associated with the slope of the regression model is inferior to 0.05. Moreover, these two variables are negatively related as the slope of the regression model is negative.
Users must, however, be aware that significant relationship between two variables does not always imply causality. Conversely, the absence of significant linear regression between y and x does not always imply an absence of relationship between these two variables; this is for example the case when a relationship is not linear.

The goodness of fit of the linear regression model is assessed from the adjusted-R2 (here, 0.05484). This value is a measure of the proportion of variation explained by the model.

Click below to see the math in more detail.

Click to display ⇲

Click to hide ⇱

overline{R}^2=1-(1-R^2){n-1}/{n-p-1}

where

p is the total number of regressors and n is the sample size,
R^2={SS_reg}/{SS_tot}
{SS_tot}=sum{i}{}{({y_i}-overline{y})}^2 is the total sums of squares,
{SS_reg}=sum{i}{}{(hat{y_i}-overline{y})}^2 is the regression sums of squares - also called the explained sums of squares.

The higher the adjusted-R2 is, the better the data fit the statistical model, knowing that this coefficient varies between 0 and 1. In this case, the relationship between logMaxAbund and logMass is quite weak.
The last line of the R output represents the F-statistic of the model and its associated p-value. If this p-value is inferior to 0.05, the model explains the data relationship better than a null model.

2.6 Plotting

Linear regression results are generally represented by a plot of the response variable as a function of the explanatory variable on which the regression line is added (and if needed the confidence intervals), using the R code:

| Plot Y ~ X with regression line and CI
plot(logMaxAbund ~ logMass, data=bird, pch=19, col="yellowgreen", 
                   ylab = expression("log"[10]*"(Maximum Abundance)"), xlab = expression("log"[10]*"(Mass)"))
abline(lm2, lwd=2)
 
# You may also flag the previously identified high-leveraged points
points(bird$logMass[32], bird$logMaxAbund[32], pch=19, col="violet")
points(bird$logMass[21], bird$logMaxAbund[21], pch=19, col="violet")
points(bird$logMass[50], bird$logMaxAbund[50], pch=19, col="violet")
 
# We can also plot the confidence intervals
confit<-predict(lm2,interval="confidence")
points(bird$logMass,confit[,2]) 
points(bird$logMass,confit[,3])

2.7 Subsetting

We may also run the analysis on a subset of observations, for example, on terrestrial birds only.

| Regression on Subset of Observations
# Recall that you can exclude objects using "!" 
# We can analyze a subset of this data using this subset command in lm() 
lm3 <- lm(logMaxAbund ~ logMass, data=bird, subset =! bird$Aquatic) # removing the Aquatic birds
 
# or equivalently
lm3 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Aquatic == 0)
 
# Examine the model
opar <- par(mfrow=c(2,2))
plot(lm3, pch=19, col=rgb(33,33,33,100,maxColorValue=225))
summary(lm3)
par(opar)
 
# Compare the two datasets
opar <- par(mfrow=c(1,2))
plot(logMaxAbund ~ logMass, data=bird, main="All birds", ylab = expression("log"[10]*"(Maximum Abundance)"), 
     xlab = expression("log"[10]*"(Mass)"))
abline(lm2,lwd=2)
 
plot(logMaxAbund ~ logMass, data=bird, subset=!bird$Aquatic, main="Terrestrial birds",
     ylab = expression("log"[10]*"(Maximum Abundance)"), xlab = expression("log"[10]*"(Mass)"))
abline(lm3,lwd=2)
opar(par)

CHALLENGE 1

Examine the relationship between log10(MaxAbund) and log10(Mass) for passerine birds.
HINT: Passerine is coded 0 and 1 just like Aquatic. You can verify this by viewing the structure str(bird).

Challenge 1: Solution


Analysis of Variance (ANOVA) is a type of linear model for a continuous response variable and one or more categorical explanatory variables. The categorical explanatory variables can have any number of levels (groups). For example, the variable “colour” might have three levels: green, blue, and yellow. ANOVA tests whether the means of the response variable differ between the levels. For example, if blueberries differ in their mass depending on their colour.

ANOVA calculations are based on the sum of squares partitioning and compares the within-treatment variance to the between-treatment variance. If the between-treatment variance is greater than the within-treatment variance, this means that the treatments affect the explanatory variable more than the random error (corresponding to the within-treatment variance), and that the explanatory variable is likely to be significantly influenced by the treatments.

In the ANOVA, the comparison of the between-treatment variance to the within-treatment variance is made through the calculation of the F-statistic that correspond to the ratio of the mean sum of squares of the treatment (MSTrt) on the mean sum of squares of the error (MSE). These two last terms are obtained by dividing their two respective sums of squares by their corresponding degrees of freedom, as is typically presented in a ANOVA table (click to see below). Finally, the p-value of the ANOVA is calculated from the F-statistic that follows a Chi-square (χ2) distribution.

Click to see the math in more detail below.

Click to display ⇲

Click to hide ⇱

Source of
variation
Degrees of
freedom (df)
Sums of squares Mean squares F-statistic
Total ra-1 {SS_Tot}=sum{i,j}{}({y_ij}-overline{y})^2
Factor Aa-1 {SS_FactorA}= r sum{i}{}({overline{y}_i}-overline{y})^2{MS_FactorA}={SS_FactorA}/{a-1}F={MS_FactorA}/{MS_E}
Error a(r-1) {SS_E}= sum{i,j}{}({y_ij}-{overline{y}_i})^2 {MS_E}={SS_E}/{a(r-1)}

a: number of levels of the explanatory variable A; r: number of replicates per treatment; overline{y}: general mean of the explanatory variable; overline{y}i : mean of the explanatory variable for all the replicates of the treatment i.

3.1 Types of ANOVA

  1. One-way ANOVA
    One categorical explanatory variable with 2 or more levels. If there are 2 levels a t-test can be used alternatively.
  2. Two-way ANOVA (see section below)
    - 2 categorical explanatory variables or more,
    - Each categorical explanatory variable can have multiple levels,
    - The interactions between each categorical explanatory variable must be tested.
  3. Repeated measures
    ANOVA can be used for repeated measures, but we won't cover this today. Linear Mixed-effect Models can also be used for this kind of data (see Workshop 6).

3.2 T-test

When you have a single explanatory variable with only two levels, you can run a student's T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means.

Click to see the math in more detail below.

Click to display ⇲

Click to hide ⇱

For the t-test, the t statistic used to find the p-value calculation is calculated as:
t= (overline{y}_1-overline{y}_2)/sqrt{{s_1}^2/n_1 + {s_2}^2/n_2}

where

overline{y}1 and overline{y}2 are the means of the response variable y for group 1 and 2, respectively,
s12 and s22 are the variances of the response variable y for group 1 and 2, respectively,
n1 and n2 are the sample sizes of groups 1 and 2, respectively.

Note that the t-test is mathematically equivalent to a one-way ANOVA with 2 levels.

Assumptions

If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test.

  1. Normality of data
    As with simple linear regression, the residuals need to be normally distributed. If the data are not normally distributed, but have reasonably symmetrical distributions, a mean which is close to the centre of the distribution, and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.
  2. Homoscedasticity
    Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.
    The robustness of the t-test increases with sample size and is higher when groups have equal sizes.
    We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are s12 and s22.
    To do so, we must do the variance ratio test (i.e. an F-test).

Violation of assumptions

If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, the non-parametric equivalent of t-test is the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.

Running a t-test

In R, t-tests are implemented using the function t.test. For example, to test for a mass difference between aquatic and non-aquatic birds, you should write:

| T-test
# T-test
boxplot(logMass ~ Aquatic, data=bird, ylab=expression("log"[10]*"(Bird Mass)"),
        names=c("Non-Aquatic","Aquatic"),
        col=c("yellowgreen","skyblue"))
 
# First, let's test the assumption of equal variance 
# Note: we do not need to test the assumption of normally distributed data since 
# we already log transformed the data above 
tapply(bird$logMass,bird$Aquatic,var)
var.test(logMass~Aquatic,data=bird)
 
# We are now ready to run the t-test        
ttest1 <- t.test(Mass~Aquatic, var.equal=TRUE, data=bird)
 
# or equivalently
ttest1 <- t.test(x=bird$logMass[bird$Aquatic==0], y=bird$logMass[bird$Aquatic==1], var.equal=TRUE)
ttest1

Two Sample t-test
 data:  logMass by Aquatic
 t = -7.7707, df = 52, p-value = 2.936e-10
 alternative hypothesis: true difference in means is not equal to 0
 95 percent confidence interval:
 -1.6669697 -0.9827343
 sample estimates:
 mean of x  mean of y 
 1.583437   2.908289 

Here, we show that the ratio of variances is not statistically different from 1, therefore variances are equal, and we proceeded with our t-test. Since p < 0.05, the hypothesis of no difference between the two bird types (Aquatic vs. terrestrial) was rejected.

Unilateral t-test

The alternative option of the t.test function allows for the use of unilateral t-test. For example, if users want to test if non-aquatic birds are less heavy than aquatic birds, the function can be written:

| Unilateral t-test
# Alternative T-test
uni.ttest1 <- t.test(logMass~Aquatic, var.equal=TRUE, data=bird, alternative="less")
uni.ttest1

In the R output, called by uni.ttest1, the results of the t-test appear in the third line:

	Two Sample t-test
  data:  logMass by Aquatic
  t = -7.7707, df = 52, p-value = 1.468e-10
  alternative hypothesis: true difference in means is less than 0
  95 percent confidence interval:
    -Inf -1.039331
  sample estimates:
  mean in group 0   mean in group 1 
     1.583437          2.908289 
     

In this case, the calculated t-statistic is t = -7.7707 with df = 52 degrees of freedom that gives a p-value of p-value = 1.468e-10. As the calculated p-value is inferior to 0.05, the null hypothesis is rejected. Thus, aquatic birds are significantly heavier than non-aquatic birds.

Running a t-test with lm()

A t-test is a linear model and a specific case of ANOVA with one factor with 2 levels. As such, we can also run the t-test with the lm() function in R:

| T-test as a linear model
ttest.lm1 <- lm(logMass ~ Aquatic, data=bird)
anova(ttest.lm1)
  Analysis of Variance Table
  Response: logMass
            Df  Sum Sq  Mean Sq   F value    Pr(>F)    
  Aquatic    1  19.015  19.0150   60.385     2.936e-10 ***
  Residuals 52  16.375  0.3149                      
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  

When variances are equal (i.e., two-sample t-test), we can show that t2 = F:

|
ttest1$statistic^2
anova(ttest.lm1)$F

3.3 Running an ANOVA

The t-test is only for a single categorical explanatory variable with 2 levels. For all other linear models with categorical explanatory variables we use ANOVA. First, let's visualize the data using boxplot(). Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level.
Another way to graphically view the effect sizes is to use plot.design(). This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.

| ANOVA
# Default alphabetical order
boxplot(logMaxAbund ~ Diet, data=bird)
 
# Relevel factors
med <- sort(tapply(bird$logMaxAbund, bird$Diet, median))
boxplot(logMaxAbund ~ factor(Diet, levels=names(med)), data=bird, col=c("white","lightblue1",
           "skyblue1","skyblue3","skyblue4"))
 
plot.design(logMaxAbund ~ Diet, data=bird, ylab = expression("log"[10]*"(Maximum Abundance)"))

Let's now run the ANOVA. In R, ANOVA can be called either directly with the aov function, or with the anova function performed on a linear model previously implemented with lm:

ANOVA in R
# Using aov()
aov1 <- aov(logMaxAbund ~ Diet, data=bird)
summary(aov1) 
 
# Using lm()
anov1 <- lm(logMaxAbund ~ Diet, data=bird)
anova(anov1) 

3.4 Verifying assumptions

As with the simple linear regression and t-test, ANOVA must meet the four assumptions of linear models. Below are some tips in how to test these assumptions for an ANOVA.

  1. Normal distribution
    The residuals of ANOVA model can once again be visualised in the normal QQplot. If the residuals lie linearly on the 1:1 line of the QQplot, they can be considered as normally distributed. If not, the ANOVA results cannot be interpreted.
  2. Homoscedasticity
    To be valid, ANOVA must be performed on models with homogeneous variance of the residuals. This homoscedasticity can be verified using either the residuals vs fitted plot or the scale-location plot of the diagnostic plots. If these plots present equivalent spread of the residuals for each of the fitted values, then the residuals variance can be considered homogeneous.
    A second way to assess the homogeneity of residuals variance is to perform a Bartlett test on the anova model using the function bartlett.test. If the p-value of this test is superior to 0.05, the null hypothesis H0: s12 = s22 =… = sj2 =… = sn2 is accepted and the homoscedasticity assumption is respected.
    Usual transformations of explanatory variables can be used if the homogeneity of residuals variance is not met.
  3. Additivity
    In addition to the assumption testing, it is important to consider whether the effects of two factors are additive. The effects are additive if the effect of one factor remains constant over all levels of the other factor, and that each factor influences the response variable independently of the other factor(s).

If assumptions are violated your can try to transform your data, which could potentially equalize variances and normalize residuals, and can convert a multiplicative effect into an additive effect. Or, if you can't (or don't want to) transform your data, the non-parametric equivalent of ANOVA is Kruskal-Wallis test.

Model diagnostics
# Plot for diagnostics
opar <- par(mfrow=c(2,2))
plot(anov1)
par(opar)
 
# Test assumption of normality of residuals
shapiro.test(resid(anov1))
 
# Test assumption of homogeneity of variance
bartlett.test(logMaxAbund ~ Diet, data=bird)

Ideally the first diagnostic plot should show similar scatter for each Diet level. The Shapiro and Bartlett tests are both non-significant, therefore residuals are assumed to be normally distributed and variances are assumed to be equal.

3.5 Model output

Once your ANOVA model has been validated, its results can be interpreted. The R output of ANOVA model depends of the function that has been used to implement the ANOVA. If the aov function is used to implement the ANOVA model

aov1 <- aov(logMaxAbund ~ Diet, data=bird)

the results of the ANOVA can be visualized using the function

summary(aov1)

On the other hand, if lm() is used

anov1 <- lm(logMaxAbund ~ Diet, data=bird)

the ANOVA results must be called using the function

anova(anov1)

In both cases, the R output is as follows:

         	Df 	Sum Sq 		Mean Sq 	F value 	Pr(>F)  
Diet        	4  	5.106   	1.276   	2.836 		0.0341 *
Residuals   	49 	22.052   	0.450                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This R output corresponds exactly to the ANOVA table of your model. This output also present the degrees of freedom, the sum of squares, the mean sum of squares and the F-value previously explained. For this example, the diet significantly influences the abundance of birds as the p-value is inferior to 0.05. The null hypothesis can then be rejected meaning that at least one of the diet treatments influenced the abundance differently than the other treatments.

3.6 Complementary test

Importantly, ANOVA cannot identify which treatment is different from the others in terms of response variable. It can only identify that a difference is present. To determine the location of the difference(s), post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function TukeyHSD as follows:

Post-hoc Tukey Test
# Where does the Diet difference lie? 
TukeyHSD(aov(anov1),ordered=T)
 
# or equivalently 
TukeyHSD(aov1,ordered=T) 

The R output for this test gives a table containing all the two by two comparisons of the explanatory variable levels and identify which treatment differ from the others:

Tukey multiple comparisons of means 95% family-wise confidence level
	 factor levels have been ordered
Fit: aov(formula = anov1)
$Diet
                              diff             lwr      	upr     	p adj
Vertebrate-InsectVert  	0.3364295 	-1.11457613 	1.787435 	0.9645742
Insect-InsectVert      	0.6434334 	-0.76550517 	2.052372 	0.6965047
Plant-InsectVert       	0.8844338 	-1.01537856 	2.784246 	0.6812494
PlantInsect-InsectVert 	1.0657336 	-0.35030287 	2.481770 	0.2235587
Insect-Vertebrate      	0.3070039 	-0.38670951 	1.000717 	0.7204249
Plant-Vertebrate       	0.5480043 	-0.90300137 	1.999010 	0.8211024
PlantInsect-Vertebrate 	0.7293041  	0.02128588 	1.437322 	0.0405485
Plant-Insect           	0.2410004 	-1.16793813 	1.649939 	0.9884504
PlantInsect-Insect     	0.4223003 	-0.19493574 	1.039536 	0.3117612
PlantInsect-Plant      	0.1812999 	-1.23473664 	1.597336 	0.9961844

In this case, the only significant difference in abundance occurs between the PlantInsect diet and the Vertebrate diet.

3.7 Plotting

After having verified the assumptions of your ANOVA model, interpreted the ANOVA table and differentiated the effect of the treatments using post-hoc tests or contrasts, the ANOVA results can be graphically illustrated using a barplot. This shows the response variable as a function of the explanatory variable levels, where standard errors can be superimposed on each bar as well as the different letters representing the treatment group (according to the post-hoc test).

Barplot
# Graphical illustration of ANOVA model using barplot()
 
sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd) 
means <- tapply(bird$logMaxAbund,list(bird$Diet),mean)
n <- length(bird$logMaxAbund)
se <- 1.96*sd/sqrt(n)
 
bp <- barplot(means, col=c("white","lightblue1","skyblue1","skyblue3","skyblue4"), 
       ylab = expression("log"[10]*"(Maximum Abundance)"), xlab="Diet", ylim=c(0,1.8))
 
# Add vertical se bars
segments(bp, means - se, bp, means + se, lwd=2)
# and horizontal lines
segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2)
segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)

3.8 Contrasts (advanced section/ optional)

Click to display ⇲

Click to hide ⇱

  • Contrasts are group mean comparisons based on an a priori hypothesis,
  • These groups can be compounded of one or many levels of a factor ,
  • We can test basic hypothesis (ex: μ1 = μ2) or more complex hypothesis (ex: (μ1 + μ2)/3 == μ3).

The number of comparisons has to be lower or equal to the number of degrees of freedom of the ANOVA. Comparisons have to be independent from one another.

Contrasts are given as an R output for ANOVA can be called to visualize the parameter estimates for each level of the explanatory variable in comparison to a baseline level. This output is called using the function summary.lm(aov1) when the ANOVA model was implemented with the aov function, and using the summary(anov1) when the ANOVA was implemented with the lm function. This output performs a linear regression for each level of the explanatory variable and calculates their associated parameters:

Call: 	lm(formula = logMaxAbund ~ Diet, data = bird)
Residuals:
   	Min       	1Q   		Median       	3Q     	 	Max 
     -1.85286 	-0.32972 	-0.08808  	0.47375  	1.56075 
Coefficients:
                  Estimate 	Std. Error  t value 	Pr(>|t|)    
(Intercept)       1.1539     	0.1500      7.692 	5.66e-10 ***
DietInsectVert   -0.6434      0.4975     -1.293  	0.2020    
DietPlant         0.2410      0.4975      0.484  	0.6303    
DietPlantInsect   0.4223      0.2180      1.938   	0.0585 .  
DietVertebrate   -0.3070      0.2450     -1.253   	0.2161    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6709 on 49 degrees of freedom
Multiple R-squared:  0.188,     Adjusted R-squared:  0.1217 
F-statistic: 2.836 on 4 and 49 DF,  p-value: 0.0341

Users can note that the last line of this R output corresponds exactly to the previous R output shown in the ANOVA table. The F-statistic of the ANOVA model and its associated p-value (i.e. 2.836 and 0.0341), are the same one as the values obtained from the ANOVA table, indicating diet explains the abundance better than a null model, and so, diet significantly influence the abundance. The goodness-of-fit of the ANOVA model (i.e. adjusted R-square value) appears in the second to last line of this output. In this case, diet explains 12.17% of the abundance variability.
Contrasts adjust a linear regression of the response variable as function of each level of the categorical explanatory variable separately. In this case, 5 linear regressions (corresponding to the five lines of the coefficients table of the R output) are calculated by the lm function as the diet variable contains 5 levels. By default, the baseline level corresponding to the intercept is the first level of explanatory variables ranked in alphabetical order. So in this case, the “Insect” diet is automatically used as a baseline in R. In the R output, the coefficient estimate of the baseline level (here, 1.1539) is first compared to 0 using a t-test (in this case, the t-test is significant with a p-value of 5.66e-10), while the coefficient estimates of the other explanatory variable level are compared to the baseline level. In this case, only the PlantInsect diet differs from the Insect diet, with an associated p-value of 0.0585.
In other words, this R output allows us to determine the mean of the response variable for each of the diet levels, for example:

LogMaxAbund = 1.1539 for the Insect diet,
LogMaxAbund = 1.1539 – 0.6434 for the InsectVert diet,
LogMaxAbund = 1.1539 + 0.2410 for the Plant diet,
etc.

As this type of contrasts compares each level of the explanatory variable to a baseline level, they are called contr.treatment and constitute the default method of the lm function in R. The baseline level can, however, be changed using the relevel function. For example,

Relevel Factors
bird$Diet2 <- relevel(bird$Diet, ref="Plant")
anov_rl <- lm(logMaxAbund ~ Diet2, data=bird)
summary(anov_rl)
anova(anov_rl) 

compares each diet treatment to the Plant diet, now defined as the baseline level.

The contrasts coefficient matrix of these contr.treatment comparisons can be called by

contrasts(bird$Diet2)
         		Insect 		InsectVert 	PlantInsect 	Vertebrate
Plant           	0     	 	0     		0           		0
Insect           	1     		0     		0           		0
InsectVert              0      		1     		0           		0
PlantInsect             0      		0     		1           		0
Vertebrate          	0      		0    		0          		1

where each column corresponds to a comparison to the baseline Plant and each line to a diet level. For example, the first comparison compares the Insect diet to the Plant diet, the second one compares the InsectVert diet to the Plant diet, etc.

Users can create their own contrasts coefficient matrix in order to perform the comparison they desire using the contrasts function. For example,

contrasts(bird$Diet2) <- cbind(c(4,-1,-1,-1,-1), c(0,1,1,-1,-1), c(0,0,0,1,-1), c(0,1,-1,0,0))

creates this contrasts coefficient matrix:

              [,1] 	[,2] 	[,3] 	[,4]
Plant          4    	 0    	 0   	 0
Insect        -1    	 1    	 0    	 1
InsectVert    -1    	 1    	 0   	-1
PlantInsect   -1   	-1    	 1    	 0
Vertebrate    -1   	-1   	-1    	 0

that compares:

  • the Plant diet to all the other diets in the first comparison,
  • the InsectVert and the Insect diets to the PlantInsect and the Vertebrate diets in the second one,
  • the PlantInsect diet to the Vertebrate diet in the third one,
  • and the Insect diet to the InsectVert diet in the fourth one.

Thus, for each column, the treatments with identical contrasts coefficient belong to the same two by two comparison group (e.g. in the column 1, the four treatment with a -1 coefficient belong to the first comparison group and are altogether compare to the second group corresponding to the treatment with a different coefficient, here the Plant diet with a coefficient 4). Thus, all comparison possible can be performed with personalized contrasts coefficient matrix. Only two conditions restrict the possible contrasts coefficient matrix:

  1. for each column, the sum of the coefficients must be equal to 0, and
  2. the sum of the product of each pair of column must be equal to 0.

This can be verified using the following R commands

sum(contrasts(bird$Diet)[,1]) # first condition for column 1 
sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2]) # second condition for column 1 and 2

Conventional contrast coefficient matrices are already programmed in R, such as the contr.helmert or the contr.poly matric (cf. help(contrasts) ).

In the above section, the ANOVA models had a single categorical variable. We can create ANOVA models with multiple categorical explanatory variables. When there are two categorical explanatory variables, we refer to the model as a two-way ANOVA. A two-way ANOVA tests several hypotheses: that there is no difference in mean among levels of variable A; that there is no difference in mean among levels of variable B; and that there is no interaction between variables A and B. A significant interaction means the mean value of the response variable for each level of variable A changes depending on the level of B. For example, perhaps relationship between the colour of a fruit and its mass will depend on the plant species: if so, we say there is an interaction between colour and species.

Click to see the math in more detail below.

Click to display ⇲

Click to hide ⇱

The one-way ANOVA table has to be rewritten to add the second explanatory term as well as the interaction term. Thus, a two-way ANOVA table corresponds to:

Source of
variation
Degrees of
freedom (df)
Sums of squares Mean squares F-statistic
Total abr-1 {SS_Tot}=sum{i,j,k}{}({y_ijk}-overline{y})^2
Cells ab-1 {SS_Cells}= sum{i,j}{}({overline{y}_ij}-overline{y})^2
Within-
cells (error)
ab(r-1) {SS_E}= sum{i,j,k}{}({y}_ijk-{overline{y}_ij})^2 {MS_E}={SS_E}/{ab(r-1)}
Factor A a-1 {SS_FactorA}= rb sum{i}{}({overline{y}_i.}-overline{y})^2{MS_FactorA}={SS_FactorA}/{a-1}{F_FactorA}={MS_FactorA}/{MS_E}
Factor B b-1 {SS_FactorB}= ra sum{j}{}({overline{y}_.j}-overline{y})^2{MS_FactorB}={SS_FactorB}/{b-1}{F_FactorB}={MS_FactorB}/{MS_E}
Interaction
AB
(a-1)(b-1) {SS_AB}= r sum{i,j,k}{}({overline{y}_{..k}}-{overline{y}_{.jk}}-{overline{y}_{i.k}})^2 {MS_AB}={SS_AB}/{(a-1)(b-1)}{F_AB}={MS_AB}/{MS_E}

a: number of levels of the explanatory variable A; b: number of levels of the explanatory variable B; r: number of replicates per treatment

4.1 Running a two-way ANOVA

In R, a two-way ANOVA model is implemented in the same fashion as a one-way ANOVA using the function lm.


CHALLENGE 2

Examine the effects of the factors Diet, Aquatic, and their interaction on the maximum bird abundance.

Recall: Before interpreting the ANOVA results, the model must first be validated by verifying the statistical assumptions of ANOVA, namely the:

  1. Normal distribution of the model residuals
  2. Homoscedasticty of the residuals variance

This verification can be done using the four diagnostic plots as previously explained for one-way ANOVA.

Challenge 2: Solution


4.2 Interaction plot

Interactions can also be viewed graphically using the function interaction.plot as:

Interaction Plot
interaction.plot(bird$Diet, bird$Aquatic, bird$logMaxAbund, col="black", 
                 ylab = expression("log"[10]*"(Maximum Abundance)"), xlab="Diet")

What do the gaps in the line for the Aquatic group mean?

Unbalanced design
table(bird$Diet, bird$Aquatic)
             0  1
Insect      14  6
InsectVert   1  1
Plant        2  0
PlantInsect 17  1
Vertebrate   5  7

The design is unbalanced; unequal observations among diet levels for Aquatic (coded as 1) and Terrestrial (coded as 0). See advanced section below for details on unbalanced ANOVA designs.


CHALLENGE 3

Test the significance of the Aquatic factor by comparing nested models with and without this categorical variable.

Challenge 3: Solution


Click to display ⇲

Click to hide ⇱

One-way and two-way ANOVA enabled us to determine the effect of categorical explanatory variables on a continuous response variable for the case of balanced experimental designs (i.e. when all levels of the explanatory variables contain the same number of replicates). However, loss of experimental units over the course of an experiment, or technical restriction of experimental designs can result in unbalanced designs. In this case, the above-mentioned ANOVA tests lead to misleading results related due to incorrect sum of squares calculations. For unbalanced experimental design, ANOVA must be modified to correctly account for the missing values of the response variable.

While mathematical model, statistical hypothesis and statistical assumptions for ANOVA with unbalanced designs remain the same as for ANOVA with balanced designs the sum of squares calculation changes.

For unbalanced design, ANOVA thus test the hypothesis:

H0: µ1 = µ2 =… = µi =… = µn
H1: there is at least one µi that differs from the others.

Using this mathematical model:

{y_{ijk}} = µ + {A_i} + {B_j} + {A_i}{B_j} + {ε_{ijk}}

Recall the sum of squares calculation of the ANOVA with balanced design:

{SS_FactorA} = rb sum{i}{}{({overline{y}_{i.}}-overline{y})}^2 = SS(A)
{SS_FactorB} = ra sum{j}{}{({overline{y}_{.j}}-overline{y})}^2 = SS(B|A) = SS(A,B)-SS(B)
{SS_AB} = r sum{i,j,k}{}({overline{y}_{..k}}-{overline{y}_{.jk}}-{overline{y}_{i.k}})^2 = SS(A,B,AB)-SS(A,B)

This corresponds to sequential sum of squares, or type I sum of squares, as the main effect of B is calculated after removing the main effect of A, and the interaction effect is calculated after removing the two main effects. These calculations are sample size dependent as the effect of each factor is calculated after removing the effect of the precedent factor.

For unbalanced design, ANOVA results will depend on the order in which each explanatory variable appears in the model. This can be seen by comparing the results of the following two models:

Unbalanced ANOVA
unb_anov1 <- lm(logMaxAbund ~ Aquatic + Diet, data=bird)
unb_anov2 <- lm(logMaxAbund ~ Diet + Aquatic, data=bird)
anova(unb_anov1)
anova(unb_anov2)

While the same explanatory variables are used in these two models, the ANOVA tables show different results due to the unbalanced design (i.e. different number of observations for aquatic and non-aquatic birds).

For unbalanced designs, marginal sum of squares, or type III sum of squares that perform calculations of main effect after removing the effect of all other factors ensure independence from sample size effects:

{SS_FactorA}={SS(A|B,AB)}=SS(A,B,AB)-SS(B,AB)
{SS_FactorB}={SS(B|A,AB)}=SS(A,B,AB)-SS(A,AB)
{SS_AB}={SS(AB|B,A)}=SS(A,B,AB)-SS(B,AB)

In R, ANOVA with type III sum of squares can be implemented using the Anova function of package ‘car’ and specifying “III” in the type option, for example:

Type III SS
Anova(unb_anov1,type="III")

By comparing ANOVA tables of models with different order of explanatory variables, users can now note that the ANOVA table results always remain the same. Type III sum of squares thus correctly calculates the ANOVA results by becoming independent of sample sizes.

After having verifying the model assumptions, results can finally be safely interpreted.

Analysis of covariance (ANCOVA) is a linear model that tests the influence of one categorical explanatory variable (or more) and one continuous explanatory variable (or more) on a continuous response variable. Each level of the categorical variable is described by its own slope and intercept. In addition to testing if the response variable differs for at least one level of the categorical variable, ANCOVA also tests whether the response variable might be influenced by its relationship with the continuous variable (called the covariate in ANCOVA), and by any differences between group levels in the way that the continuous variable influences the response (i.e. the interaction). The ANCOVA hypotheses are thus: that there is no difference in the mean among levels of the categorical variable; there is no correlation between the response variable and the continuous explanatory variable; there is no interaction between the categorical and continuous explanatory variables.

6.1 Assumptions

As with models seen above, to be valid ANCOVA models must meet the statistical assumptions of linear models that can be verified using diagnostic plots. In addition, ANCOVA models must have:

  1. The same value range for all covariates
  2. Variables that are fixed
  3. No interaction between categorical and continuous variables

Note: A fixed variable is one that you are specifically interested in (i.e. bird mass). In contrast, a random variable is noise that you want to control for (i.e. site a bird was sampled in). If you have random variables, see the workshop on Linear Mixed-effect Models!

6.2 Types of ANCOVA

You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex.

The most frequently used ANCOVAs are those with:

  1. one covariate and one factor
  2. one covariate and two factors
  3. two covariates and one factor

The different possible goals of the ANCOVA are to determine the effects of:

  1. the categorical and continuous variables on the response variable
  2. the categorical variable(s) on the response variable(s) after removing the effect of the continuous variable
  3. the categorical variable(s) on the relationship between the continuous variables(s) and the response variable

Importantly, these goals are only met if there is no significant interaction between the categorical and continuous variables! Examples of significant interactions between the categorical and continuous variables (for an ANCOVA with one factor and one covariate) are illustrated by the second and third panels below:

The same logic follows for ANCOVAs with multiple categorical and/or continuous variables.

6.3 Running an ANCOVA

Running an ANCOVA in R is comparable to running a two-way ANOVA, using the function lm. However, instead of using two categorical variables (Diet and Aquatic), we now use one categorical and one continuous variable.

For example, using a build in dataset called CO2, where the response variable is uptake, the continuous variable is conc and the factor is Treatment, the ANCOVA is:

ANCOVA example
ancova.example <- lm(uptake ~ conc*Treatment, data=CO2)
anova(ancova.example)

If only your categorical variable is significant, drop your continuous variable from the model: you will then have an ANOVA.
If only your continuous variable is significant, drop your categorical variable from the model, you will then have a simple linear regression.
If your interaction is significant, you might want to test which levels of your categorical variables ha(s)ve different slopes and to question whether ANCOVA is the most appropriate model.

In the CO2 example above, both the continuous and categorical variable are significant, but the interaction is non-significant. If your replace Treatment with Type, however, you will see an example of a significant interaction.

If you want to compare means across factors, you can use adjusted means, which uses the equations given by the ANCOVA to estimate the means of each level of the categorical variable, corrected for the effect of the categorical variable:

Adjusted means
install.packages("effects")
library(effects)
adj.means <- effect('Treatment', ancova.example)
plot(adj.means)
 
adj.means <- effect('conc*Treatment', ancova.example)
plot(adj.means)

CHALLENGE 4

Run an ANCOVA to test the effect of Diet, Mass, and their interaction on MaxAbund.

Challenge 4: Solution


Multiple regression tests the effects of several continuous explanatory variables on a response variable.

7.1 Assumptions

In addition to the usual assumptions of linear models, it is important to test for orthogonality because it will affect model interpretation. Variables are not orthogonal when explanatory variables are collinear. If one explanatory variable is correlated to another, they are likely to explain the same variability of the response variable, and the effect of one variable will be masked by the other.

If you see any pattern between two explanatory variables, they are collinear. Collinearity must be avoided as the effect of each explanatory variable will be confounded! Possible solutions are:

  1. Keep only one of the collinear variables,
  2. Try multidimensional analysis (see workshop 9),
  3. Try a pseudo-orthogonal analysis.

Collinearity between explanatory variables can be assessed based on the variance inflation factor using the vif function of package ‘HH’:

Variance Inflation Factor
vif(clDD ~ clFD + clTmi + clTma + clP + grass, data=Dickcissel)

which gives the following output:

  clFD     	clTmi     	clTma       	clP     	grass 
 13.605855  	9.566169  	4.811837  	3.196599  	1.165775

As variance inflation factor higher than 5 represents collinear variables, the R output shows that clDD, clFD and clTmi are highly collinear. Only one of these explanatory variables can thus be retained in the final regression model.

7.2 Dickcissel dataset

The Dickcissel dataset explores environmental variables that drive the abundance and presence/ absence of a grassland bird with peak abundances in Kansas, USA. It contains 15 variables:

Variable Name Description Type
abund The number of individuals
observed at each route
Continuous/ numeric
Present Presence/ absence of the
species
Boolean (“Present”/ “Absent”)
broadleaf, conif, crop, grass,
shrub, urban, wetland
Land use variables within 20 km radius
of the center route
Continuous/ numeric
NDVI Vegetation index (a measure of productivity) Interger
clDD, clFD, clTma, clTmi, clP Climate date (DD = degree days,
FD = frost days, Tma = max temperature,
Tmi = min temperature,
P = precipitation)
Continuous/ numeric

In R, multiple regression are implemented using the lm function and its results are viewed using the summary function. Using, for example, the Dickcissel data, we can test the effects of climate, productivity and land cover on the abundance of the Dickcissel species abundance by applying the model:


CHALLENGE 5

Is a transformation needed for the response variable abund?

Challenge 5: Solution


As you likely noticed in Challenge 5, the abund variable could not be normalized, suggesting that we might need to relax the assumptions of a normally distributed response variable and move on to Generalized Linear Models, but that will wait until later!

For now, let's simply use the untransformed abund and compare the relative importance of the three variables (climate, productivity, and land cover) on abund

Multiple Regression
lm.mult <- lm(abund ~ clTma + NDVI + grass, data=Dickcissel)
summary(lm.mult)

The R output enables one to visualize the significant explanatory variables:

lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel)
Residuals:
 Min      	1Q  		Median      	3Q    		 Max 
-35.327 	-11.029  	-4.337   	2.150 		180.725 
Coefficients:
           	Estimate 	Std. Error 	t value 	Pr(>|t|)    
(Intercept) 	-83.60813   	11.57745 	-7.222 		1.46e-12 ***
clTma         3.27299    	0.40677   	8.046 		4.14e-15 ***
NDVI          0.13716    	0.05486   	2.500   	0.0127 *  
grass        	10.41435    	4.68962   	2.221   	0.0267 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.58 on 642 degrees of freedom
Multiple R-squared:  0.117,     Adjusted R-squared:  0.1128 
F-statistic: 28.35 on 3 and 642 DF,  p-value: < 2.2e-16

In this case, the three explanatory variables significantly influence the abundance of the Dickcissel species, the most significant one being the climate (p-value=4.14e-15). Altogether these variables explain 11.28% of the Dickcissel abundance variability (Adjusted R-squared= 0.1128). The overall model is also significant and explains the Dickcissel abundance variability better than a null model (p-value: < 2.2e-16).

A plot of the response variable as a function of each explanatory variable can be used to represent graphically the model results:

plot(abund ~ clTma, data=Dickcissel, pch=19, col="orange")
plot(abund ~ NDVI, data=Dickcissel, pch=19, col="skyblue")
plot(abund ~ grass, data=Dickcissel, pch=19, col="green")

7.3 Polynomial regression (advanced section/ optional)

Click to display ⇲

Click to hide ⇱

Response variables are not always linearly related to explanatory variables. In this case, linear regression that fits a straight line through the two variables is unable to correctly represent the data. Instead, polynomial regression that fits a polynomial curves between the response variable and the explanatory variables can be used to represent non-linear relationship based on the mathematical model:

{y_i} = {β_0} + {β_1}{x_i} + {β_2}{{x_i}^2} + {β_3}{{x_i}^3} + {ε_i} for a polynomial of Degree 3
{y_i} = {β_0} + {β_1}{x_i} + {β_2}{{x_i}^2} + {ε_i} for a polynomial of Degree 2

where

β0 is the intercept of the regression line,
β1 is effect of the variable x,
β2 is effect of the variable x2,
β3 is effect of the variable x3,
εi are the residuals of the model (i.e. the unexplained variation).

The Degree is the largest exponent of that variable. When you know a Degree, you can also give the polynomial a name:

In R, these models are implemented with the lm function and can be compared to a linear regression with the anova function:

Polynomial Regression
lm.linear <- lm(abund ~ clDD, data=Dickcissel)
lm.quad <- lm(abund ~ clDD + I(clDD^2), data=Dickcissel)
lm.cubic <- lm(abund ~ clDD + I(clDD^2) + I(clDD^3), data=Dickcissel)

CHALLENGE 7

Compare the different polynomial models in the previous example, and determine which model is the most appropriate. Extract the adjusted R squared, the regression coefficients, and the p-values of this chosen model.

Challenge 7: Solution


The model comparison shows that the quadratic regression (i.e. the polynomial of degree 2) is the best model. The cubic term can thus be dropped from the final model:

Analysis of Variance Table
Model 1: abund ~ clDD
Model 2: abund ~ clDD + I(clDD^2)
Model 3: abund ~ clDD + I(clDD^2) + I(clDD^3)
Model     Res.Df    	RSS 	        Df 	Sum of Sq      	 F   		Pr(>F)    
1    	    644 	365039                                  
2    	    643 	355871  	1    	9168.3 		16.5457         5.34e-05 ***
3    	    642 	355743  	1     	127.7 		0.2304   	0.6314

The R output for the final model is:

Call: lm(formula = abund ~ clDD + I(clDD^2), data = Dickcissel)
Residuals:
    Min      		1Q  		Median      	3Q     		Max 
   -14.057 		-12.253  	-8.674   	1.495 		190.129 
Coefficients:
            	Estimate 	Std. Error 	t value 	Pr(>|t|)    
(Intercept) 	-1.968e+01  	5.954e+00  	-3.306    	0.001 ** 
clDD         	1.297e-02  	2.788e-03   	4.651 		4.00e-06 ***
I(clDD^2)   	-1.246e-06  	3.061e-07 	-4.070 		5.28e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.53 on 643 degrees of freedom
Multiple R-squared:  0.04018,   Adjusted R-squared:  0.0372 
F-statistic: 13.46 on 2 and 643 DF,  p-value: 1.876e-06

In this example, the linear term influenced the response variable more than the quadratic term, as their p-values are respectively 4.00e-06 and 5.28e-05. They explained together 3.72% of the abundance variability (Adjusted R-squared).

7.4 Stepwise regression

Click to display ⇲

Click to hide ⇱

To obtain a final multiple regression model, users can first implement a full regression model containing all the explanatory variables and then drop the non-significant variable using a stepwise selection procedure. In this method, non-significant variables are successively dropped one by one from the model and the goodness-of-fit of each successive model are compared based on AIC (Akaike’s Information Criterion), until all the explanatory variables of the model are significant. Note that a lower AIC value indicates a better goodness of fit (i.e., the best model is the one with the lowest AIC). In R, stepwise selection is implemented using the function step:

Stepwise Regression
lm.full <- lm(abund ~ . - Present, data=Dickcissel) 
lm.step <- step(lm.full)
summary(lm.step)

According to the stepwise selection, only 6 significant explanatory variables among the 13 tested are retained in the final model:

 Call: 	lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass,  data = Dickcissel)
 Residuals:
  Min      	1Q  		Median      	3Q     		Max 
  -30.913  	-9.640  	-3.070   	4.217 		172.133 
Coefficients:
            	Estimate 	Std. Error 	t value 	Pr(>|t|)    
(Intercept) 	-4.457e+02  	3.464e+01 	-12.868  	< 2e-16 ***
clDD         	5.534e-02  	8.795e-03   	6.292 		5.83e-10 ***
clFD         	1.090e+00  	1.690e-01   	6.452 		2.19e-10 ***
clTmi       	-6.717e+00  	7.192e-01  	-9.339  	< 2e-16 ***
clTma      	3.172e+00  	1.288e+00   	2.463 		0.014030 *  
clP        	1.562e-01  	4.707e-02   	3.318 		0.000959 ***
grass       	1.066e+01  	4.280e+00   	2.490 		0.013027 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.85 on 639 degrees of freedom
Multiple R-squared:  0.3207,    Adjusted R-squared:  0.3144 
F-statistic: 50.29 on 6 and 639 DF,  p-value: < 2.2e-16

The model now accounts for 31.44% of the Dickcissel abundance variability, the clTmi being the most significant explanatory variable.
Nevertheless, some of the selected explanatory variables are highly correlated and should be dropped from the final model in order to remove uninformative variables.

Click to display ⇲

Click to hide ⇱

To assess the relative contribution of two (or more) explanatory datasets to a response variable, the varpart function of vegan package can be used. This function partitions the explained variance of a response variable to compare the contribution of various sets of explanatory variables. For example, the contribution of land cover variables on the first hand, and of climate variables on the other hand on Dickcissel abundance can be implemented using:

Variance Partitioning
part.lm = varpart(Dickcissel$abund, Dickcissel[,c("clDD", "clFD", "clTmi", "clTma", "clP")], 
                  Dickcissel[,c("broadleaf", "conif", "grass", "crop", "urban", "wetland")])	
part.lm

The R output of this function enables one to visualize the variance partitioning results:

Partition of variation in RDA
Call: varpart(Y = Dickcissel$abund, X = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")],
Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")])
Explanatory tables:
X1:  Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")]
X2:  Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")] 
No. of explanatory tables: 2 
Total variation (SS): 370770 
Variance: 574.84 
No. of observations: 646 
Partition table:
                     Df       R.squared       Adj.R.squared        Testable
[a+b] = X1           5        0.31414         0.30878              TRUE
[b+c] = X2           6        0.03654         0.02749              TRUE
[a+b+c] = X1+X2      11       0.32378         0.31205              TRUE
Individual fractions                                    
[a] = X1|X2          5                        0.28456              TRUE
[b]                  0                        0.02423              FALSE
[c] = X2|X1          6                        0.00327              TRUE
[d] = Residuals                               0.68795              FALSE
---

Use function rda to test significance of fractions of interest

This R output shows that the two explanatory datasets explain 31.205% ([a+b+c] = X1+X2) of the Dickcissel abundance variability, while the climate dataset contributes to 28.46% of the Dickcissel abundance variability ([a] = X1|X2) and the land cover dataset only contributes to 0.33% of the Dickcissel abundance variability ([c] = X2|X1). The interaction of these two datasets also explained 2.42% ([b]) of the variability. The significance of each fraction can be tested using partial RDA and permutational ANOVA using the functions rda and anova:

Partial RDA and Permutational ANOVA
# Climate set
out.1 = rda(Dickcissel$abund, Dickcissel[,c("clDD", "clFD", "clTmi", "clTma", "clP")], 
          	Dickcissel[,c("broadleaf", "conif", "grass", "crop", "urban", "wetland")])
anova(out.1, step=1000, perm.max=1000)
 
# Land cover set
out.2 = rda(Dickcissel$abund, Dickcissel[,c("broadleaf", "conif", "grass", "crop", "urban", "wetland")], 
        Dickcissel[,c("clDD", "clFD", "clTmi", "clTma", "clP")])
anova(out.2, step=1000, perm.max=1000)

with the following R ouputs:

Climate set
Permutation test for rda under reduced model
Model: rda(X = Dickcissel$abund, Y = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")], 
       Z = Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")])
       	Df    	Var      F 	   N.Perm 	Pr(>F)    
Model      	5 	165.12 	 53.862    999 		0.001 ***
Residual 	634 	388.72

Land cover set
Permutation test for rda under reduced model
Model: rda(X = Dickcissel$abund, Y=Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")],
        Z = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")])
      	Df    	Var      F 	   N.Perm	Pr(>F)
Model     	6   	5.54 	 1.5063    999  	0.152
Residual 	634 	388.72    

In this case, the fraction of variance explained by the climate set is significant (p-value=0.001) while the fraction explained by the land cover set is not (p-value=0.152).

The results of variance partitioning are generally graphically represented using Venn diagrams on which each explanatory datasets is represented by a circle inside which their corresponding fraction of explained variance is annotated:

Graphical Representation of Varpart
showvarparts(2)
plot(part.lm,digits=2, bg=rgb(48,225,210,80,maxColorValue=225), col="turquoise4") 

Amazing! You are now ready to perform your own regression, ANOVA and ANCOVA! But never forget to correctly specify your model and verify its statistical assumptions before interpreting its results according to the ecological background of your data.

Some exciting books about linear regression and ANOVA:

  • Myers RH - Classical and Modern Regression with Application
  • Gotelli NJ - A Primer of Ecological Statistics