__MAJOR UPDATE__
As of Fall 2021, this wiki has been discontinued and is no longer being actively developed.
All updated materials and announcements for the QCBS R Workshop Series are now housed on the [[https://r.qcbs.ca/workshops/r-workshop-08/|QCBS R Workshop website]]. Please update your bookmarks accordingly to avoid outdated material and/or broken links.
Thank you for your understanding,
Your QCBS R Workshop Coordinators.
======= QCBS R Workshops =======
[[http://qcbs.ca/|{{:logo_text.png?nolink&500|}}]]
This series of [[r|10 workshops]] walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology.
These open-access workshops were created by members of the Quebec Centre for Biodiversity Science both for members of the QCBS and the larger community.
//The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page//
IMPORTANT NOTICE: MAJOR UPDATES
**March 2021 update:** This wiki has been discontinued and is no longer being actively developed or updated. Updated materials for the QCBS R Workshop Series are now housed on the QCBS R Workshop [[https://github.com/QCBSRworkshops/workshop08|GitHub page]].
Available materials include;
- The [[https://qcbsrworkshops.github.io/workshop08/pres-en/workshop08-pres-en.html|Rmarkdown presentation]] for this workshop;
- An [[https://qcbsrworkshops.github.io/workshop08/book-en/workshop08-script-en.R|R script]] that follows the presentation - //*under construction*//;
- [[https://qcbsrworkshops.github.io/workshop08/book-en/index.html|Written materials]] that accompany the presentation in bookdown format - //*under construction*//.
====== Workshop 8: Generalized additive models ======
Developed by: Eric Pedersen and Zofia Taranu
**Summary:** The goal of today's workshop will be to first examine what we mean by a non-linear model, and how GAMs (Generalized Additive Models) allow us to handle non-linear relationships. We will also go over how to plot and interpret these non-linear relationships, how to include interaction terms, autocorrelated errors and expand on previous workshop by briefly examining a mixed modelling framework. Lastly, we will briefly touch upon what GAMs are doing behind the scenes.
**Link to new [[https://qcbsrworkshops.github.io/workshop08/workshop08-en/workshop08-en.html|Rmarkdown presentation]]**
Link to old [[https://prezi.com/ktlh8t8-03yt/|Prezi presentation]]
Download the R script and data for this lesson:
- [[http://qcbs.ca/wiki/_media/gam_e.r|Script]]
- [[http://qcbs.ca/wiki/_media/other_dist.csv|Data]]
===== Learning objectives =====
- Use the //mgcv// package to fit non-linear relationships,
- Understand the output of a GAM to help you understand your data,
- Use tests to determine if a non-linear model fits better than a linear one,
- Include smooth interactions between variables,
- Understand the idea of a basis function, and why it makes GAMs so powerful,
- Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.
**Prerequisites:** Some experience in R (enough to be able to run a script and examine data and R objects) and a basic knowledge of regression (you should know what we mean by linear regression and ANOVA).
===== 0. The linear model... and where if fails =====
What do we mean by the linear model? Regression is the workhorse of statistics. It allows us to model a response variable as a function of predictors plus error. Linear regression is what most people first encounter in statistics. As we saw in the [[http://qcbs.ca/wiki/r_workshop4|linear models]] workshop, regression makes four major assumptions:
- normally distributed error
- the variance of the errors doesn't change (homoscedastic)
- i.i.d. (each error is independent of the others)
- a linear response: {y} = {β_0} + {β_1}{x}
There's only one way for the linear model to be right:
{{::graphic0.1.png?350|}}
And yet so many ways for it to fail:
{{::graphic0.2.png?525|}}
So how can we fix it? We must first know what the regression model is trying to do:
* fit a line that goes through the middle of the data,
* do this without overfitting the data, such as simply using a line between each point and its neighbours.
Linear models do this by finding the best fit straight line that passes through the data. In contrast, additive models do this by fitting a curve through the data, but controlling how wiggly the line can get //(more on this later)//.
===== 1. Introduction to GAMs =====
Let's look at an example. First, we'll generate some data, and plot it.
library(ggplot2)
set.seed(10)
n = 250
x = runif(n,0,5)
y_model = 3*x/(1+2*x)
y_obs = rnorm(n,y_model,0.1)
data_plot = qplot(x, y_obs) +
geom_line(aes(y=y_model)) +
theme_bw()
print(data_plot)
{{::graphic1.1.jpg?350|}}
We can appreciate that if we were to fit this as a linear regression model, we would violate the assumptions listed above. Let's have a look, using the ''gam()'' command from the //mgcv// package here to model an ordinary least squares regression //(we will see below how to use ''gam()'' to specify a smoothed, non-linear term)//.
library(mgcv)
linear_model = gam(y_obs~x)
model_summary=summary(linear_model)
print(model_summary)
data_plot = data_plot+
geom_line(colour="red",
aes(y=fitted(linear_model)))
print(data_plot)
We can see in the model summary that our linear model explains quite a bit of variance (R2(adj) = 0.639), however, diagnostic plots of the model residuals would quickly show that the error variance is not normally distributed nor homoscedastic, and that an important non-linear pattern remains. Let's now try to fix this by fitting the data using a smoothed (non-linear) term.
We will revisit this later, but briefly, GAMs are effectively a nonparametric form of regression where the beta *xi of a linear regression is replaced by an arbitrary smooth functions of the explanatory variables, f(xi), and the model becomes:
{y_i} = f(x_{i}) + {ε_i}
where yi is the response variable, xi covariate, and //f// is the smooth function.
Importantly, given that the smooth function f(xi) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response. That is, as opposed to one fixed coefficient beta , the function //f// can continually change over the range of xi. The degree of smoothness (or wiggliness) of //f// is controlled using penalized regression determined automatically in //mgcv// using a generalized cross-validation (GCV) routine (Wood 2006).
In ''gam()'' smooth terms are specified by expressions of the form: ''s(x)'' where x is the covariate which the smooth is a function of.
gam_model = gam(y_obs~s(x))
summary(gam_model)
data_plot = data_plot +
geom_line(colour="blue",aes(y=fitted(gam_model)))
print(data_plot)
The variance explained by our model has increased by 20% (R2(adj) = 0.859) and when we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is the winner.
{{::graphic1.3.jpg?375|}}
The //mgcv// package also includes a default plot to look at the smooths:
plot(gam_model)
How do we test whether the non-linear model offers a significant improvement over the linear model? We can use the ''gam()'' and ''anova()'' commands to formally test whether an assumption of linearity is justified. To do, we must simply set our smoothed model so that it is nested in our linear model; that is, we create model object that includes both ''x'' (linear) and ''s(x)'' (non-linear) and we ask whether adding ''s(x)'' to the model with only ''x'' as a covariate is supported by the data.
linear_model = gam(y_obs~x)
nested_gam_model = gam(y_obs~s(x)+x)
print(anova(linear_model, nested_gam_model, test="Chisq"))
The non-linear term is significant:
Analysis of Deviance Table
Model 1: y_obs ~ x
Model 2: y_obs ~ s(x) + x
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 248.00 6.5846
2 240.68 2.4988 7.3168 4.0858 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the model ''y_obs~s(x)'' gives exactly the same results as ''y_obs~s(x)+x'', as shown above for the ''nested_gam_model''. We used the ''s(x)+x'' to illustrate the nestedness of the model, however, we will omit the ''+x'' in the nested model comparisons that follow.
----
**CHALLENGE 1**
We will now try this test with some new simulated data, just to get a handle on it. First generate the data using the code below, then fit a linear and smoothed GAM model to the relation between ''x_test'' and ''y_test_obs''.
What is the effective degrees of freedom of the smoothed term? Determine if linearity is justified for this data.
n <- 250
x_test <- runif(n,-5,5)
y_test_fit <- 4*dnorm(x_test)
y_test_obs <- rnorm(n,y_test_fit, 0.2)
++++ Challenge 1 Solution |
data_plot <- qplot(x_test, y_test_obs) +
geom_line(aes(y=y_test_fit))+
theme_bw()
print(data_plot)
linear_model_test <- gam(y_test_obs~x_test)
nested_gam_model_test <- gam(y_test_obs~s(x_test)+x_test)
print(anova(linear_model_test, nested_gam_model_test, test="Chisq"))
summary(nested_gam_model_test)$s.table
Analysis of Deviance Table
Model 1: y_test_obs ~ x_test
Model 2: y_test_obs ~ s(x_test) + x_test
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 248.0 81.09
2 240.5 7.46 7.5012 73.629 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
edf Ref.df F p-value
s(x_test) 7.602145 8.029057 294.0944 0
Answer: Yes non-linearity is justified. The effective degrees of freedom (edf) are >> 1.
{{::challenge_1_soln.jpg?350|}}
++++
===== 2. Multiple smooth terms =====
GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions. For this section, we'll be using a dataset automatically generated by the ''gamSim()'' command of the //mgcv// package. For further details on what type of data you can simulate with this function, see ''?gamSim''. For this section, we will use the example 5 (eg=5) of gamSim, which simulates an additive example plus a factor variable.
gam_data = gamSim(eg=5)
head(gam_data)
We will now look at how we can predict ''y'' based on the other variables. Start with a basic model, with one smoothed term (x1) and one categorical predictor (x0, which has 4 levels).
basic_model = gam(y~x0+s(x1), data= gam_data)
basic_summary = summary(basic_model)
print(basic_summary$p.table)
print(basic_summary$s.table)
plot(basic_model)
Here, the ''p.table'' provides the significance table for each parametric term and the ''s.table'' is the significance table for smooths. Note that for the latter, the wiggleness of the smoothed term ''s(x1)'' is indicated by the edf parameter (effective degrees of freedom); the higher the edf, the more non-linear the smoothing spline. A high value (8–10 or higher) means that the curve is highly non-linear, whereas a smoother with 1 effective degree of freedom is a straight line. In contrast, in linear regression the //model// degrees of freedom is equivalent to the number of non-redundant free parameters, p, in the model (and the //residual// degrees of freedom are given by n-p). We will revisit the edf later in this workshop.
print(basic_summary$p.table)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.550030 0.3655849 23.387258 1.717989e-76
x02 2.418682 0.5165515 4.682364 3.908046e-06
x03 4.486193 0.5156501 8.700072 9.124666e-17
x04 6.528518 0.5204234 12.544629 1.322632e-30
> print(basic_summary$s.table)
edf Ref.df F p-value
s(x1) 1.923913 2.406719 42.84268 1.076396e-19
In our basic model the edf of smooth function s(x1) is ~2, which suggests a non-linear curve. The plot of the model nicely illustrates the shape of this non-linear smoother:
{{::graphic2.2.jpg?425|}}
We can add a second term, x2, but specify a linear relationship with y (that is, both linear and non-linear terms can be included in GAM). The new linear term, x2, will appear in the ''p.table'', for which a regression coefficient estimate is indicated. In the ''s.table'', we will once again find the non-linear smoother, s(x1), and its wiggleness parameter.
two_term_model <- gam(y~x0+s(x1)+x2, data=gam_data)
two_term_summary <- summary(two_term_model)
print(two_term_summary$p.table)
print(two_term_summary$s.table)
If we wish to explore whether the relationship between y and x2 is non-linear, we can model x2 as non-linear smooth term instead. As before, we can use an ANOVA to test if the smoothed term is necessary.
two_smooth_model <- gam(y~x0+s(x1)+s(x2), data=gam_data)
two_smooth_summary <- summary(two_smooth_model)
print(two_smooth_summary$p.table)
print(two_smooth_summary$s.table)
plot(two_smooth_model,page=1)
{{::graphic2.4.jpg?600|}}
When more than one covariable is included in the model, as above, the fitted response can be partitioned into the contributions of each variable as shown. Here we can appreciate the varying magnitude of the effect of each variable; where the y-axis represents the contribution (effect) of each covariate to the fitted response, centered on 0. If the confidence intervals had overlapped with zero for certain values of x (or throughout the entire range), this would imply a non-significant effect at those x values (or of x in entirety). When the contribution for an individual covariate changes along the range of x-axis, the change in that covariate is associated with a change in the response.
anova(basic_model,two_term_model,two_smooth_model, test="Chisq")
Analysis of Deviance Table
Model 1: y ~ x0 + s(x1)
Model 2: y ~ x0 + s(x1) + x2
Model 3: y ~ x0 + s(x1) + s(x2)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 394.08 5231.6
2 393.10 4051.3 0.97695 1180.2 < 2.2e-16 ***
3 385.73 1839.5 7.37288 2211.8 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The best fit model is the model with both smooth terms for x1 and x2.
----
**CHALLENGE 2**
Create two new models, with x3 as a linear and smoothed term.
Use plots, coefficient tables and the anova function to determine if x3 is an important term to include.
++++ Challenge 2 solution |
three_term_model <- gam(y~x0+s(x1)+s(x2)+x3, data=gam_data)
three_smooth_model <- gam(y~x0+s(x1)+s(x2)+s(x3), data=gam_data)
three_smooth_summary <- summary(three_smooth_model)
print(three_smooth_summary$p.table)
print(three_smooth_summary$s.table)
plot(three_smooth_model,page=1)
# edf = 1 therefore term is linear.
anova(two_smooth_model,three_term_model,test="Chisq")
# term x3 is not significant
++++
===== 3. Interactions =====
There are two ways to include interactions between variables:
* if one variable is smoothed, and the other isn't: using the 'by' argument,
* fully smoothing both variables.
Where the 'by' argument lets you have a smooth term vary between different levels of a factor. We will examine this using our categorical variable x0 and ask whether the non-linear smoother s(x2) varies across different levels of x0. To determine whether smooth terms differ significantly among factor levels, we will use an ANOVA test on the interaction term.
categorical_interact <- gam(y~x0+s(x1)+s(x2,by=x0),data=gam_data)
categorical_interact_summary <- summary(categorical_interact)
print(categorical_interact_summary$s.table)
plot(categorical_interact,page=1)
# or alternatively: plot using vis.gam function, where theta is the degree rotation on the x-y plane
vis.gam(categorical_interact,view=c("x2","x0"),theta=40,n.grid=500,border=NA)
anova(two_smooth_model, categorical_interact,test="Chisq")
{{::graphic3.1b.png?350|}}
Visually, we see that the shape of smooth terms are comparable among the four levels of x0. The anova test confirms this as well (Deviance = 98.6, p = 0.2347).
Finally we'll look at the interactions two smooth terms, x1 and x2. Here the 'by' argument is dropped (both quantitative variables).
smooth_interact <- gam(y~x0+s(x1,x2),data=gam_data)
smooth_interact_summary <- summary(smooth_interact)
print(smooth_interact_summary$s.table)
plot(smooth_interact,page=1,scheme=3)
# plot(smooth_interact,page=1,scheme=1) will give a similar plot to the vis.gam()
vis.gam(smooth_interact,view=c("x1","x2"),theta=40,n.grid=500,border=NA)
anova(two_smooth_model,smooth_interact,test="Chisq")
{{::graphic3.2b.png?350|}}
The interaction between s(x1) and s(x2) is significant and the 2D plot nicely illustrates this non-linear interactions, where y is greatest at high values of x1 but low to mid-values of x2 (try playing around with ''theta'' to rotate the figure around. If you plan to run a lot of these, delete the argument ''n.grid=500'' to make the plots run faster).
===== 4. Changing basis =====
We won't go too much further into this in this workshop, but you should know you can expand on the basic model we've looked at today with:
* more complicated smooths, by changing the basis,
* other distributions: anything you can do with a GLM using the family argument,
* mixed effect models, using gamm or the gamm4 package.
We will first have a look at changing the basis, followed by a quick intro to using other distribution and GAMMs (Generalized Additive Mixed Models). Let's look at one place where knowing how to change the basis can help: cyclical data. That is, data where you want the predictor to match at the ends. Imagine you have a time series of climate data, broken down by monthly measurement, and you want to see if there's a time-trend in yearly temperature. We'll use the Nottingham temperature time series for this:
data(nottem)
n_years <- length(nottem)/12
nottem_month <- rep(1:12, times=n_years)
nottem_year <- rep(1920:(1920+n_years-1),each=12)
nottem_plot <- qplot(nottem_month,nottem,
colour=factor(nottem_year),
geom="line") + theme_bw()
print(nottem_plot)
Using the nottem data, we have created three new vectors; ''n_years'' corresponds to the number of years of data (20 years), ''nottem_month'' is a categorical variable coding for the 12 months of the year, for every year sampled (sequence 1 to 12 repeated 20 times), and ''nottem_year'' is a variable where the year corresponding to each month is provided.
Monthly data from the years 1920 through to 1940:
{{::graphic4.1.jpg?500|}}
To model this, we need to use what's called a cyclical cubic spline, or "cc", to model month effects as well as a term for year.
year_gam <- gam(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
summary(year_gam)$s.table
plot(year_gam,page=1, scale=0)
{{::graphic4.2.jpg?700|}}
There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average.
The actual data vary around these values and that is the unexplained variance. Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.
Over longer time scales, for example using paleolimnological data, others ([[http://www.aslo.info/lo/toc/vol_54/issue_6_part_2/2529.pdf|Simpson & Anderson 2009; Fig. 3c]]) have used GAMs to plot the contribution (effect) of temperature on algal assemblages in lakes, to illustrate how significant contributions only occurred during two extreme cold events (that is, the contribution is significant when the confidence intervals do not overlap zero, at around 300 and 100 B.C.). This allowed the authors to not only determine how much variance was explained by temperature over the last few centuries, but also pinpoint when in time this effect was significant. If of interest to you, the code to plot either the response surface (''type = "response"'') or the terms (''type = "terms"'') is given below. When terms is selected, you obtain the same figure as above.
++++ Contribution plot|
pred<-predict(year_gam, type = "terms", se = TRUE)
I<-order(nottem_year)
plusCI<-I(pred$fit[,1] + 1.96*pred$se[,1])
minusCI<-I(pred$fit[,1] - 1.96*pred$se[,1])
xx <- c(nottem_year[I],rev(nottem_year[I]))
yy <- c(plusCI[I],rev(minusCI[I]))
plot(xx,yy,type="n",cex.axis=1.2)
polygon(xx,yy,col="light grey",border="light grey")
lines(nottem_year[I], pred$fit[,1][I],lty=1)
abline(h=0)
++++
===== 5. Other distributions =====
To provide a brief overview on how to use GAMs when the response variable does not follow a normal distributions and is either count or proportion data (e.g., Gamma, binomial, Poisson, negative binomial), the example that follows uses a dataset where a binomial family distribution is needed and a non-linear relationship with the explanatory variable is evident. Here, the response variable represents the number of successes (event happened) versus failures over the course of an experiment.
gam_data3 <- read.csv("other_dist.csv")
summary(gam_data3)
str(gam_data3)
'data.frame': 514 obs. of 4 variables:
$ prop : num 1 1 1 1 0 1 1 1 1 1 ...
$ total: int 4 20 20 18 18 18 20 20 20 20 ...
$ x1 : int 550 650 750 850 950 650 750 850 950 550 ...
$ fac : Factor w/ 4 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...
''prop'' is the response variable, and is equal to the proportion of //successes/ (successes + failures)//. Note that there are numerous cases where the proportion equals 1 or 0 which indicates that the outcomes were always successes or failures, respectively, at a given time point in the experiment.\\
''x1'' is the time since the start of experiment (our explanatory variable).\\
''total'' represents the number of //successes + failures// observed at any time point of the experiment.\\
''fac'' is a factor coding for trials 1 through 4 of the experiment (we will not use this variable in this section).
Let's start by visualizing the data. We are interested in the number of successes in comparison to failures as x1 increases.
We will calculate the grand averages of the proportions of successes per time bin (x1) given that there are repeated measures per x1 value (numerous trials and observations per trial).
emptyPlot(range(gam_data3$x1), c(0,1), h=.5,
main="Probability of successes", ylab="Probability",xlab="x1")
avg <- aggregate(prop ~ x1, data=gam_data3, mean, na.rm=TRUE)
lines(avg$x1, avg$prop, col="orange",lwd=2)
{{::graphic1_otherdist.jpg?450|}}
As x1 increases, so does the probability of successes. Would you say that this trend is linear or non-linear? We will test this using a logistic GAM (we use a ''binomial'' family distribution given that our response is proportion data).
prop_model <- gam(prop~ s(x1), data=gam_data3, weights=total, family="binomial")
prop_summary <- summary(prop_model)
print(prop_summary$p.table)
print(prop_summary$s.table)
plot(prop_model)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.173978 0.02709613 43.32641 0
edf Ref.df Chi.sq p-value
s(x1) 4.591542 5.615235 798.9407 2.027751e-164
What does the intercept represent in this model?
* Recall that the model uses the count outcomes to calculate the //logit//, which is the log odds ratio between successes and failures:
logit = log({N_{success}/N_{failures}})
- If successes = failures, the ratio is 1 and the logit is 0 (log(1) = 0).\\
- If successes have a larger count than failures, the ratio is greater than 1 and the logit has a positive value (e.g., log(2) = 0.69).\\
- If successes have a smaller count than failures, the ratio is lower than 1 and the logit has a negative value (e.g., log(0.5) = -0.69).\\
Thus, the intercept is the log odds ratio of successes to failures (logit), and indicates whether on average there are more successes than failures. Here, the estimated intercept coefficient is positive, which means that there are more successes than failures overall.
What does the smooth term indicate?
* This represents how the log odds of successes vs failures changes over x1 (time in this example). So, we see that since the edf>1, the proportion of successes increases faster over time (if for example the response represents the count of species A versus species B, and nutrient concentrations are increased over time, these results would indicate that species A is increasingly observed as nutrient concentrations approach its niche optima over the course of the experiment).
=== Visualizing the trend over time ===
Lastly, we will see the different ways this relationship could be represented graphically.
par(mfrow=c(1,2))
plot(prop_model, select=1, scale=0, shade=TRUE)
abline(h=0)
out <- plot_smooth(prop_model, view="x1",main="")
(diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$x1))
addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2)
abline(v=c(diff$start, diff$end), lty=3, col='red')
text(mean(c(diff$start, diff$end)), 2.1, "sign. more \n success", col='red', font=3)
{{::graphic2_otherdist.jpg?800|}}
What do these plots tell us about successes versus failures?
* Left plot: contribution/ partial effect (if we had more than one explanatory variable). Over time the log odds increases, so over time successes increase and failures decrease.
* Right plot: fitted values, intercept included (summed effect if we had more than one explanatory variable included in the model). Here we see that the log odds ratio is estimated around zero at the start of the experiment; this means there are equal amounts of successes and failures. Gradually successes increase, and at around x1=400 there are significantly more successes than failures (the smooth is significantly different from zero). We also illustrated how we could use the plot to determine at which x1 value this occurs.
Lastly, to help interpret the results, we could transform the summed effects back to proportions with the function ''plot_smooth'' from the //itsadug// package:
par(mfrow=c(1,1))
plot_smooth(prop_model, view="x1", main="",
transform=plogis, ylim=c(0,1))
abline(h=.5, v=diff$start, col='red', lty=2)
{{::graphic3_otherdist.jpg?450|}}
As we already derived from the logit plot, we see that at around x1=400 the proportion of successes increases significantly above 0.5.
===== 6. Quick intro to GAMMs =====
==== Dealing with non-independence ====
When observations are not independent, GAMs can be used to either incorporate:
* a serial correlation structure to model residual autocorrelation (autoregressive AR, moving average MA, or a combination of the two ARMA),
* random effects that model independence among observations from the same site.
That is, in addition to changing the basis as with the nottem example above, we can also add complexity to the model by incorporating an autocorrelation structure or mixed effects using the //gamm// or the //gamm4// package.
To start, let's have a look at the first case; a model with temporal autocorrelation in the residuals. We will revisit the Nottingham temperature model and test for correlated errors using the (partial) autocorrelation function.
par(mfrow=c(1,2))
acf(resid(year_gam), lag.max = 36, main = "ACF")
pacf(resid(year_gam), lag.max = 36, main = "pACF")
{{::graphic5.1.jpg?700|}}
The autocorrelation function plot (ACF; first panel above) provides the cross-correlation of a time series with itself at different points in time (i.e. similarity between observations at increasingly large time lags). In contrast, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values after controlling for the values of the time series at all shorter lags. The ACF and pACF plots are thus used to identify after how many time steps do observations start to be independent from one another. The ACF plot of our model residuals suggests a significant lag of 1, and perhaps a lag of 2. Therefore, a low-order AR model is likely needed. We can test this by adding AR structures to the Nottingham temperature model, one with an AR(1) (correlation at 1 time step) and one with an AR(2) (correlation at two times steps), and test for the best-fit model using ANOVA.
year_gam <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
year_gam_AR1 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
correlation = corARMA(form = ~ 1|nottem_year, p = 1))
year_gam_AR2 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
correlation = corARMA(form = ~ 1|nottem_year, p = 2))
anova(year_gam$lme,year_gam_AR1$lme,year_gam_AR2$lme)
Model df AIC BIC logLik Test L.Ratio p-value
year_gam$lme 1 5 1109.908 1127.311 -549.9538
year_gam_AR1$lme 2 6 1101.218 1122.102 -544.6092 1 vs 2 10.689206 0.0011
year_gam_AR2$lme 3 7 1101.598 1125.962 -543.7988 2 vs 3 1.620821 0.2030
The AR(1) provides a significant increase in fit over the naive model (LRT = 10.69, p = 0.0011), but there is very little improvement in moving to the AR(2) (LRT = 1.62, p = 0.203). So it is best to include only the AR(1) structure in our model.
==== Mixed modelling ====
As we saw in the previous section, ''bs'' specifies the type of underlying base function. For random intercepts and linear random slopes we use bs="re", but for random smooths we use bs="fs".
Three different types of random effects are distinguished when using GAMMs (where //fac// is a categorial variable coding for the random effect and //x0// is a continuous fixed effect):
* //random intercepts// adjust the height of other model terms with a constant value: s(fac, bs="re")
* //random slopes// adjust the slope of the trend of a numeric predictor: s(fac, x0, bs="re")
* //random smooths// adjust the trend of a numeric predictor in a nonlinear way: s(x0, factor, bs="fs", m=1), where the argument m=1 sets a heavier penalty for the smooth moving away from 0, causing shrinkage to the mean.
We will first examine a GAMM with only a random intercept. As before, we will use the ''gamSim()'' function to automatically generate a dataset, this time with a random effect component generate, then run a model with a random intercept using //fac// as the random factor.
# generate and view data
gam_data2 <- gamSim(eg=6)
head(gam_data2)
# run random intercept model
gamm_intercept <- gam(y ~ s(x0) + s(fac, bs="re"), data=gam_data2)
# examine model output
summary(gamm_intercept)$s.table
Note that there is now a smoother term for the random intercept in the summary table. You can plot and view the random intercepts for each level of //fac// as follows:
plot(gamm_intercept, select=2)
# select=2 because the random effect appears as the second entry in the summary table.
We can also use the ''plot_smooth'' function to visualize the model, which in contrast to the default ''plot.gam'', allows us to plot a smooth of the summed effects of a GAM (based on predictions) as we saw earlier, but in addition, optionally removes the random effects. Here, we will plot the summed effects for the x0 without random effects, and then plot the predictions of all four levels of the random //fac// effect:
par(mfrow=c(1,2), cex=1.1)
plot_smooth(gamm_intercept, view="x0", rm.ranef=TRUE, main="intercept + s(x1)", rug=FALSE)
plot_smooth(gamm_intercept, view="x0", cond=list(fac="1"),
main="... + s(fac)", col='orange', ylim=c(8,21), rug=FALSE)
plot_smooth(gamm_intercept, view="x0", cond=list(fac="2"), add=TRUE, col='red')
plot_smooth(gamm_intercept, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
plot_smooth(gamm_intercept, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
{{::graphic5.3.jpg?650|}}
Next we will run and plot a model with random slopes:
gamm_slope <- gam(y ~ s(x0) + s(x0, fac, bs="re"), data=gam_data2)
summary(gamm_slope)$s.table
plot_smooth(gamm_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
plot_smooth(gamm_slope, view="x0", cond=list(fac="1"),
main="... + s(fac)", col='orange',ylim=c(7,22), rug=FALSE)
plot_smooth(gamm_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red')
plot_smooth(gamm_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
plot_smooth(gamm_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
{{::graphic5.4.jpg?650|}}
We will now include both a random intercept and slope term.
gamm_int_slope <- gam(y ~ s(x0) + s(fac, bs="re")
+ s(fac, x0, bs="re"), data=gam_data2)
summary(gamm_int_slope)$s.table
plot_smooth(gamm_int_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="1"),
main="... + s(fac) + s(fac, x0)", col='orange', ylim=c(7,22), rug=FALSE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red', xpd=TRUE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple', xpd=TRUE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise', xpd=TRUE)
{{::graphic5.5.jpg?650|}}
Note that the random slope is static in this case:
plot(gamm_int_slope, select=3)
# select=3 because the random slope appears as the third entry in your summary table.
Lastly, we will examine a model with a random smooth.
gamm_smooth <- gam(y ~ s(x0, fac, bs="fs", m=1), data=gam_data2)
summary(gamm_smooth)$s.table
Here, if the random slope varied along x0, we would see different curves for each level:
plot(gamm_smooth, select=1)
# select=1 because the smooth slope appears as the first entry in your summary table.
All of the above models can in turn be compared using ''anova()'' as in the previous sections to determine the best fit model.
===== 7. GAMs behind the scenes ======
We will now take a few minutes to look at what GAMs are doing behind the scenes. Lets first consider a model containing one smooth function of one covariate, xi:
{y_i} = f(x_{i}) + {ε_i}
To estimate the smooth function //f//, we need to represented the above equation in such a way that it becomes a linear model. This can be done by choosing a basis, bi(x), defining the space of functions of which //f// is an element:
f(x) = sum{i=1}{q}{b_{i}(x)β_{i}}
**A simple example: a polynomial basis**
Suppose that //f// is believed to be a 4th order polynomial, so that the space of polynomials of order 4 and below contains //f//. A basis for this space would then be:
b_{1}(x)=1, b_{2}(x)=x, b_{3}(x)=x^{2}, b_{4}(x)=x^{3}, b_{5}(x)=x^{4}
so that f(x) becomes:
f(x) = β_{1} + x_{i}β_{2} + x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5}
and the full model now becomes:
y_{i} = β_{1} + x_{i}β_{2} + x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5} + ε_{i}
The basis functions are each multiplied by a real valued parameter, βi, and are then summed to give the **final curve f(x)**.
{{::polynomial_basis_example.png?600|}}
By varying the βi we can vary the form of f(x) to produce any polynomial function of order 4 or lower.
**Another example: a cubic spline basis**
A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.
{{::cubic_spline.png?450|}}
Here's a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:
{{::graphic6.1.jpg?300|}}
In this example, the knots are evenly spaced through the range of observed x values.
However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary. Is there a better way to select the knot locations?
**Controlling the degree of smoothing with penalized regression splines**
Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.
So, rather than fitting the model by minimizing (as with least squares regression):
||y - XB||^{2}
it can be fit by minimizing:
||y - XB||^{2} + {lambda}int{0}{1}{[f^{''}(x)]^{2}dx}
As lambda goes to infty , the model becomes linear. The selection of the best fit smoothing parameter, lambda , makes use of a cross-validation approach. If lambda is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed. Ideally, it would be good to choose lambda so that the //predicted f// is as close as possible to //f//. A suitable criterion might be to choose lambda to minimize:
M = 1/n sum{i=1}{n}{(hat{f_{i}} - f_{i})^{2}}
Since //f// is unknown, M is estimated using a generalized cross validation technique that leaves out each datum from the data in turn and considers the average ability of models fitted to the remaining data to predict the left out datum (for further details, see Wood 2006).
**Illustration of the principle behind cross validation**
{{::illustration_of_smooth_sel.png?600|}}
In the first panel, the curve fits many of the data poorly and does no better with the missing point. In the third panel, the curve fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly. In the second panel, however, we see that the curve fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.
{{::gcv.png?600|}}
**Brief note on effective degrees of freedom (edf)**
How many degrees of freedom does a fitted GAM have?
Instead of providing the output of the cross-validation in terms of lambda (some tuning parameter modulating the fit’s complexity), the GAM function in the //mgcv// package uses a term called the effective degrees of freedom (edf) as a consistent way to quantify model complexity (that is, to justify our intuition that the degrees of freedom tracks model complexity).
Because the number of free parameters in GAMs is difficult to define, the edf are instead related to lambda , where the effect of the penalty is to reduce the effective degrees of freedom. So let's say the arbitrarily large number of knots is set to k=10, then k-1 sets the upper limit on the degrees of freedom associated with a smooth term. This number then decreases as the penalty lambda increases until the best fit penalty is found by cross-validation.
===== References =====
There's a great deal more out there on GAMs... this was just the very surface.
Simon Wood, the author of the mgcv package has a very useful website with introductory talks and notes on how to use GAMs:
* http://people.bath.ac.uk/sw283/mgcv/
He's also written a book, Generalized Additive Models: An Introduction with R, which we used as reference for this workshop.
Material from this workshop were also obtained from the following fantastic blogs and tutorials:
* http://www.fromthebottomoftheheap.net/blog/
* http://www.sfs.uni-tuebingen.de/~jvanrij/Tutorial/GAMM.html
* http://www.sfs.uni-tuebingen.de/~jvanrij/LSA2015/AnswersLab2.html
Finally, the help pages, available through ?gam in R, are an excellent resource.